I have a private Azure Storage account and using PySpark locally, I would like to read a Blob. Here is the setup:
access_key = <storage-account-access-key>
spark = SparkSession.builder.master('local').appName('app').getOrCreate()
spark.conf.set("fs.azure.account.<storage-account-name>.blob.core.windows.net", access_key)
sc = spark.sparkContext
sc._conf.setAll([("fs.azure.account.key.<storage-acccount-name>.blob.core.windows.net", access_key)])
csv_raw = sc.textFile('wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/dir')
print(csv_raw.collect())
What is my reason for using spark.sparkContext.textFile() rather than spark.read.load()? I need to read in the data as an RDD to do some data cleaning/parsing before converting into a dataframe with schema. It is very odd because I can read in the data as a dataframe using spark.read.load(), so the setup in the Spark Session is correct. As shown in the above code, I did manually set the config in sparkContext to ensure that it would have this parameter prior to running the textFile method. However, I get an authentication error when using spark.sparkContext.textFile():
"org.apache.hadoop.fs.azure.AzureException: No credentials found for account ... in the configuration, and its container ... is not accessible using anonymous credentials."
Please assume that all jar files (hadoop-azure-3.3.0.jar,azure-storage-8.6.5.jar) are loaded correctly with spark-submit and also note that I am using Spark version 3.1.1.
Thank you, in advance!
CodePudding user response:
For RDD API you need to provide Hadoop configuration - what you're using now is used only for Dataframe/Dataset API (see Databricks docs as reference).
So instead of fs.azure.account.key.<storage-acccount-name>.blob.core.windows.net
you need to add spark.hadoop
to it: spark.hadoop.fs.azure.account.key.<storage-acccount-name>.blob.core.windows.net