Home > other >  Reading Excel files as Spark Dataframe from ADLS storage
Reading Excel files as Spark Dataframe from ADLS storage

Time:10-13

We tried reading excel files in the following ways :

  • spark.read.format("com.databricks.spark.excel")
  • spark.read.format("com.crealytics.spark.excel")
  • spark.read.format("excel")

Received an error stating: Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key

error screenshot

Note:

  1. The azure credentials were working fine while trying to read csv file from the same ADLS.
  2. We have installed the required libraries in Databricks cluster : com.crealytics:spark-excel_2.12:3.1.2_0.16.5-pre0 & openpyxl

CodePudding user response:

You can use pandas to read .xlsx file and then convert that to spark dataframe.


from pyspark.sql import SparkSession
import pandas

spark = SparkSession.builder.appName("Test").getOrCreate()

pdf = pandas.read_excel('excelfile.xlsx', sheet_name='sheetname', inferSchema='true')
df = spark.createDataFrame(pdf)

df.show()

CodePudding user response:

Try this to acces to your ADLS


configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": f"{client_id}",
           "fs.azure.account.oauth2.client.secret": f"{client_secret}",
           "fs.azure.account.oauth2.client.endpoint": f"https://login.microsoftonline.com/{tenant_id}/oauth2/token"}


def mount_adls(container_name2):
    dbutils.fs.mount(
      source = f"abfss://{container_name2}@{storage_account_name}.dfs.core.windows.net/",
      mount_point = f"/mnt/{storage_account_name}/{container_name2}",
      extra_configs = configs)

  • Related