Home > Back-end >  Adding jars to the great_expectations' spark session
Adding jars to the great_expectations' spark session

Time:07-08

Setup:

  • My data is on Azure ADLS Gen2
  • I want to use the great_expectations package to test my data quality.
  • I am using the InferredAssetAzureDataConnector data_connector to create my data source (this works, I can see my files on the ADLS during creation).
  • I'm trying to create a test suite with the auto-profiler going through the data.

I am specifically wondering how to add jars to the spark session's config that great_expectation uses when running the auto-profiler to create a test suite.

The process fails because I need to add the org.apache.hadoop:hadoop-azure:3.3.1 jar to the spark session in order for the spark job to be able to access & profile the data on ADLS.

Any help in how to do in the context of the great_expectations package is appreciated.

The error message:


Great Expectations will create a notebook, containing code cells that select from 
available columns in your dataset and generate expectations about them to demonstrate 
some examples of assertions you can make about your data.

When you run this notebook, Great Expectations will store these 
expectations in a new Expectation Suite "adls_test_suite_tmp" here:

  file://C:\Coding\...\great_expectations\expectations/adls_suite_tmp.json

Would you like to proceed? [Y/n]: Y

WARN FileStreamSink: Assume no metadata directory. 
    Error while looking for metadata directory in the path: 
    wasbs://<adls-container>@<adls-account>.blob.core.windows.net/test/myfile.csv

java.lang.RuntimeException: java.lang.ClassNotFoundException: 
    Class org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure not found

CodePudding user response:

I semi-solved it by adding the jars to the spark-defaults.conf file, but I'm really unhappy with this dirty solution as any spark job started on the system will contain the jar packages now. If anyone has a better solution, please share.

spark.jars.packages                 com.microsoft.azure:azure-storage:8.6.6,org.apache.hadoop:hadoop-azure:3.3.1

  • Related