Export a Spark Dataframe (pyspark.pandas.Dataframe) to Excel file from Azure DataBricks-CodePudding

I'm struggling with the export of a pyspark.pandas.Dataframe to an Excel file.

I'm working on an Azure Databricks Notebook with Pyspark. My goal is to read a csv file from Azure Data Lake Storage container and store it as a Excel file on another ADLS container.

I'm finding so many difficulties related to performances and methods. pyspark.pandas.Dataframe has a built-in to_excel method but with files larger than 50MB the commands ends with time-out error after 1hr (seems to be a well known problem).

Following you can find an example of code. It ends by saving the file on the DBFS (there are still problems integrating the to_excel method with Azure) and then I move the file to the ADLS.

import pyspark.pandas as ps
spark.conf.set(f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net", storage_account_key)

reference_path = f'abfss://{source_container_nae}@{storage_account_name}.dfs.core.windows.net/{file_name}'

df = ps.read_csv(reference_path, index=None)

df.to_excel(file_name, sheet_name='sheet')

pyspark.pandas.Dataframe is the suggested method by Databricks in order to work with Dataframes (it replaces koalas) but I can't find any solution to my problem, except converting the dataframe to a normal pandas one.

Can please someone help me?

Thanks in advance!

UPDATE

Some more information of the whole pipeline.

I have a DataFactory pipeline that reads data from Azure Synapse, elaborate them and store them as csv files in ADLS. I need DataBricks because DataFactory does not have a native sink Excel connector! I know that I can use instead Azure Functions or Kubernetes, but I started using DataBricks hoping that it was possible...

CodePudding user response：

Hm.. it looks like you are reading the same file and saving to the same file.

can you change

df.to_excel(file_name, sheet_name='sheet')

df.to_excel("anotherfilename.xlsx", sheet_name='sheet')

CodePudding user response：

You should not convert a big spark dataframe to pandas because you probably will not be able to allocate so much memory. You can write it as a csv and it will be available to open in excel:

df.to_csv(path=file_name, num_files=1)