Accessing Azure ADLS gen2 with Pyspark on Databricks-CodePudding

I'm trying to learn Spark, Databricks & Azure.

I'm trying to access GEN2 from Databricks using Pyspark. I can't find a proper way, I believe it's super simple but I failed.

Currently each time I receive the following:

Unable to access container {name} in account {name} using anonymous
credentials, and no credentials found for them in the configuration.

I have already running GEN2 I have a SAS_URI to access.

What I was trying so far: (based on this link: https://docs.microsoft.com/pl-pl/azure/databricks/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sas-access):

spark.conf.set(f"fs.azure.account.auth.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
spark.conf.set(f"fs.azure.sas.token.provider.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})

Then to reach out to data:

sd_xxx = spark.read.parquet(f"wasbs://{CONTAINER_NAME}@{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net/{proper_path_to_files/}")

CodePudding user response：

Your configuration is incorrect. The first parameter should be set to just SAS value, while second - to name of Scala/Java class that will return the SAS token - you can't use just URI with SAS information in it, you need to implement some custom code.

If you want to use wasbs that the protocol for accessing Azure Blog Storage, and although it could be used for accessing ADLS Gen2 (not recommended although), but you need to use blob.core.windows.net instead of dfs.core.windows.net, and also set correct spark property for Azure Blob access.

CodePudding user response：

The more common procedure to follow is here: Access Azure Data Lake Storage Gen2 using OAuth 2.0 with an Azure service principal