Home > Blockchain >  Reading from ADLS from a Synapse Notebook with account key authentification and ABFS driver
Reading from ADLS from a Synapse Notebook with account key authentification and ABFS driver

Time:02-24

I am trying to read a file from ADLS Gen2 in Synapse and want to authenticate with the account key.

According to the docs, the following should work but doesnt in Synapse:

spark.conf.set(f"fs.azure.account.key.{adls_account_name}.dfs.core.windows.net", adls_account_key)

I want to use the ABFS driver as the docs suggest:

Optimized driver: The ABFS driver is optimized specifically for big data analytics. The corresponding REST APIs are surfaced through the endpoint dfs.core.windows.net.

What does not work:

  • When I use pyspark ABFS and execute in Synapse Notebook, I get a java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403 error.

What works:

  • When I use pyspark WASBS and execute in Synapse Notebook, it works.
  • When I use pyspark ABFS and execute locally from my local PyCharm, it works.
  • When I use python/DataLakeServiceClient in Synapse, it works.
  • When I use python/ DataLakeServiceClient from my local PyCharm, it works.

It is definitely not a problem of missing permissions but a problem with Synapse. Am I missing some configurations? Any help is appreciated. I'd rather not use the WASB API as (according to this post) ABFS should be used for ADLSGen2.

Every code has the following variables:

adls_account_key = "<myaccountkey>"
adls_container_name = "<mycontainername>"
adls_account_name = "<myaccountname>"
filepath = "/Data/Contacts"

Synapse PySpark ABFS code: (crashes)

spark.conf.set(f"fs.azure.account.key.{adls_account_name}.dfs.core.windows.net", adls_account_key)    
base_path = f"abfs://{adls_container_name}@{adls_account_name}.dfs.core.windows.net"
df = spark.read.parquet(base_path   filepath)
df.show(10, False)

Synapse PySpark WASBS code: (works)

spark.conf.set(f"fs.azure.account.key.{adls_account_name}.blob.core.windows.net", adls_account_key)    
base_path = f"wasbs://{adls_container_name}@{adls_account_name}.blob.core.windows.net"
df = spark.read.parquet(base_path   filepath)
df.show(10, False)

Synapse local Python/DataLakeServiceClient code (same on Synapse as on local): (works)

service_client = DataLakeServiceClient(
    account_url=f"https://{adls_account_name}.dfs.core.windows.net",
    credential=adls_account_key,
)
file_client = service_client.get_file_client(
    file_system=adls_container_name, file_path=filepath 
)
file_content = file_client.download_file().readall()

Local pyspark ABFS code (includes building a spark session but otherwise the exact same code): (works)

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .config('spark.jars.packages', 'org.apache.hadoop:hadoop-azure:3.3.1') \
    .getOrCreate()

spark.conf.set(f"fs.azure.account.key.{adls_account_name}.dfs.core.windows.net", adls_account_key)

base_path = f"abfs://{adls_container_name}@{adls_account_name}.dfs.core.windows.net"    
df = spark.read.parquet(base_path   filepath)
df.show(10, False)

CodePudding user response:

This post was helpful.

Apparently Synapse restricts the ABFS so it cannot authenticate with the account key. Instead Synapse only allows authentication via the LinkedServices or the service principle. This explains why I can access the ADLS with ABFS from my local PyCharm, but not from within a Synapse Notebook.

Which basically locks any code utlizing ABFS into the infrastructure of Synapse and cannot execute locally anymore unless you want to write all authentications twice (once local with account key, once with Synapse's demanded ServicePrinciple route).

Needless to say, this is beyond stupid. I think I will use WASB instead, at least until Microsoft decides to let people use the account key for authentication in Synapse.

CodePudding user response:

You are receiving this due to lack of permissions. You can observe while creating the synapse workspace it do says that we need additional user access roles that needed to be done. One must be assigned to Storage Blob Data Contributor role on the storage account in order to access the adls workspace.

enter image description here

Here are the steps to Grant permissions to managed identity in Synapse workspace

REFERENCES:

  • Related