Cannot read from Azure Blob: "org.apache.hadoop.fs.azure.AzureException: No credentials found f-CodePudding

I have a private Azure Storage account and using PySpark locally, I would like to read a Blob. Here is the setup:

access_key = <storage-account-access-key>
spark = SparkSession.builder.master('local').appName('app').getOrCreate()
spark.conf.set("fs.azure.account.<storage-account-name>.blob.core.windows.net", access_key)
sc = spark.sparkContext
sc._conf.setAll([("fs.azure.account.key.<storage-acccount-name>.blob.core.windows.net", access_key)])

csv_raw = sc.textFile('wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/dir')
print(csv_raw.collect())

What is my reason for using spark.sparkContext.textFile() rather than spark.read.load()? I need to read in the data as an RDD to do some data cleaning/parsing before converting into a dataframe with schema. It is very odd because I can read in the data as a dataframe using spark.read.load(), so the setup in the Spark Session is correct. As shown in the above code, I did manually set the config in sparkContext to ensure that it would have this parameter prior to running the textFile method. However, I get an authentication error when using spark.sparkContext.textFile():

"org.apache.hadoop.fs.azure.AzureException: No credentials found for account ... in the configuration, and its container ... is not accessible using anonymous credentials."

Please assume that all jar files (hadoop-azure-3.3.0.jar,azure-storage-8.6.5.jar) are loaded correctly with spark-submit and also note that I am using Spark version 3.1.1.

Thank you, in advance!

CodePudding user response：

For RDD API you need to provide Hadoop configuration - what you're using now is used only for Dataframe/Dataset API (see Databricks docs as reference).

So instead of fs.azure.account.key.<storage-acccount-name>.blob.core.windows.net you need to add spark.hadoop to it: spark.hadoop.fs.azure.account.key.<storage-acccount-name>.blob.core.windows.net