Scala printSchema not working when reading excel file-CodePudding

Goal is retrieve correct schema from different tables, but the code below in Scala is not working.

spark.sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", true)
.load("CERT_YHEIL_PROJECTS_PREP (1).xlsx").printSchema**strong text**
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
 |-- PK\b!A{曧�\a�[Content_Types].xml ��(�ĕMO�@����^Q�����p����R���I��~igɿg��P�b��l��3���r�]�m\f�8�Ǣ����aވ���F�D���Q.h�P\NO�L��\t��耍h��w)Q���1A�7���"��s��^�9������1Q�!��0SKG��?ސdp(����Ϋ*%g�"&���s=;�Y�`k�2��[�7�6x����dk��U���g\f�r�1����n�-�q6�L�K��1eP[�.k�: string (nullable = true)

CodePudding user response：

Spark can't read excel files so you have 2 options:

Use pandas to read and then convert it to a Spark DataFrame (Example code below)
Convert it manually to csv using the "Save As" option in Excel or any other online tool

Prerequisits:

# Install library that pandas can use to read the excel 
pip install xlrd

import pandas as pd 

pdf_excel = pd.read_excel("CERT_YHEIL_PROJECTS_PREP (1).xlsx")

df = spark.createDataFrame(pdf_excel)

df.printSchema()
df.show()

CodePudding user response：

You are trying to load excel file using the csv reader which is wrong. In order to load excel file into a DataFrame, try to use the Spark Excel connector:

import org.apache.spark.sql._
import com.crealytics.spark.excel._

val spark: SparkSession = ???
val df = spark.read.excel(
    header = true,  // Required
    dataAddress = "'My Sheet'!B3:C35", // Optional, default: "A1"
    treatEmptyValuesAsNulls = false,  // Optional, default: true
    setErrorCellsToFallbackValues = false, // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.
    usePlainNumberFormat = false,  // Optional, default: false. If true, format the cells without rounding and scientific notations
    inferSchema = false,  // Optional, default: false
    addColorColumns = true,  // Optional, default: false
    timestampFormat = "MM-dd-yyyy HH:mm:ss",  // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
    maxRowsInMemory = 20,  // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)
    maxByteArraySize = 2147483647,  // Optional, default None. See https://poi.apache.org/apidocs/5.0/org/apache/poi/util/IOUtils.html#setByteArrayMaxOverride-int-
    tempFileThreshold = 10000000, // Optional, default None. Number of bytes at which a zip entry is regarded as too large for holding in memory and the data is put in a temp file instead
    excerptSize = 10,  // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
    workbookPassword = "pass"  // Optional, default None. Requires unlimited strength JCE for older JVMs
).schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
 .load("Worktime.xlsx")