How to read excel file (.xlsx) using Pyspark and store it in dataframe?-CodePudding

I have data in excel file (.xlsx). How to read this excel data and store it in the data frame in spark?

CodePudding user response：

You could use Pandas API which is now part of PySpark.

Here is the documentation: https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html

CodePudding user response：

You should install on your databricks cluster the following 2 libraries:

Clusters -> select your cluster -> Libraries -> Install New -> Maven -> in Coordinates: com.crealytics:spark-excel_2.12:0.13.5

Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: xlrd

Then, you will be able to read your excel as follows:

<pre><code>
{sparkDF = spark.read.format("com.crealytics.spark.excel")
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("dataAddress", "'NameOfYourExcelSheet'!A1") \
    .load(filePath){
}

}