Goal is retrieve correct schema from different tables, but the code below in Scala is not working.
spark.sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", true)
.load("CERT_YHEIL_PROJECTS_PREP (1).xlsx").printSchema**strong text**
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
|-- PK\b!A{曧�\a�[Content_Types].xml ��(�ĕMO�@����^Q�����p����R���I��~igɿg��P�b��l��3���r�]�m\f�8�Ǣ����aވ���F�D���Q.h�P\NO�L��\t��耍h��w)Q���1A�7���"��s��^�9������1Q�!��0SKG��?ސdp(����Ϋ*%g�"&���s=;�Y�`k�2��[�7�6x����dk��U���g\f�r�1����n�-�q6�L�K��1eP[�.k�: string (nullable = true)
CodePudding user response:
Spark can't read excel files so you have 2 options:
Use pandas to read and then convert it to a Spark DataFrame (Example code below)
Convert it manually to csv using the "Save As" option in Excel or any other online tool
Prerequisits:
# Install library that pandas can use to read the excel
pip install xlrd
import pandas as pd
pdf_excel = pd.read_excel("CERT_YHEIL_PROJECTS_PREP (1).xlsx")
df = spark.createDataFrame(pdf_excel)
df.printSchema()
df.show()
CodePudding user response:
You are trying to load excel file using the csv reader which is wrong. In order to load excel file into a DataFrame, try to use the Spark Excel connector:
import org.apache.spark.sql._
import com.crealytics.spark.excel._
val spark: SparkSession = ???
val df = spark.read.excel(
header = true, // Required
dataAddress = "'My Sheet'!B3:C35", // Optional, default: "A1"
treatEmptyValuesAsNulls = false, // Optional, default: true
setErrorCellsToFallbackValues = false, // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.
usePlainNumberFormat = false, // Optional, default: false. If true, format the cells without rounding and scientific notations
inferSchema = false, // Optional, default: false
addColorColumns = true, // Optional, default: false
timestampFormat = "MM-dd-yyyy HH:mm:ss", // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
maxRowsInMemory = 20, // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)
maxByteArraySize = 2147483647, // Optional, default None. See https://poi.apache.org/apidocs/5.0/org/apache/poi/util/IOUtils.html#setByteArrayMaxOverride-int-
tempFileThreshold = 10000000, // Optional, default None. Number of bytes at which a zip entry is regarded as too large for holding in memory and the data is put in a temp file instead
excerptSize = 10, // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
workbookPassword = "pass" // Optional, default None. Requires unlimited strength JCE for older JVMs
).schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
.load("Worktime.xlsx")