No FileSystem for scheme "az" error when trying to read csv from ADLS Gen2 using PySpark-CodePudding

import pandas as pd
import pyspark.pandas as ps

I am trying to use the pyspark pandas api to compare performance between two similar scripts (one using pandas and one using pyspark through the pandas interface). However, I have trouble importing my data in pyspark from our ADLS Gen 2 storage.

When I run the following code it works as expected:

df_pandas = pd.read_csv(f"az://container/path/to/file.csv",sep=';', dtype=str)

However when I run the same using the pyspark pandas api:

df_spark = ps.read_csv(f"az://container/path/to/file.csv",sep=';', dtype=str)

However, when I run this the following error gets thrown:

Py4JJavaError: An error occurred while calling o1840.load.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "az"

I have looked online and found others with similar problems using AWS but I'm not sure how to solve it for Azure. I tried replacing az with abfs but I then get the error:

An error occurred while calling o1852.load.
: abfs://container/path/to/file.csv has invalid authority.

I'm running these from Azure Synapse notebooks btw.

CodePudding user response：

I reproduce same in environment.I got this output.

Reading csv files from ADLS Gen2.

Code:

import pandas 
df = pandas.read_csv('abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<file_path>', storage_options = {'account_key' : 'account_key_value'})

Output:

enter image description here

For more information refer this link1 and link2.