Home > Blockchain >  Error importing PyDeequ package on databricks
Error importing PyDeequ package on databricks

Time:12-28

I want to do some tests regarding data quality and for that I pretend to use PyDeequ on a databricks notebook. Keep in mind that I'm very new to databricks and Spark.

First, I created a cluster with the Runtime version "10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)" and added to the environment variable SPARK_VERSION=3.2, as referred in the repository's GitHub.

Since the available PyPI package is not up to date I tried installing the package through a notebook-scoped library with the following comand %pip install numpy==1.22 %pip install git https://github.com/awslabs/python-deequ.git (The first line is only to prevent a conflict on the numpy versions.)

Then, when doing import pydeequ I get

IndexError                                Traceback (most recent call last)
<command-3386600260354339> in <module>
----> 1 import pydeequ

/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
    165             # Import the desired module. If you’re seeing this while debugging a failed import,
    166             # look at preceding stack frames for relevant error information.
--> 167             original_result = python_builtin_import(name, globals, locals, fromlist, level)
    168 
    169             is_root_import = thread_local._nest_level == 1

/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/__init__.py in <module>
     19 from pydeequ.analyzers import AnalysisRunner
     20 from pydeequ.checks import Check, CheckLevel
---> 21 from pydeequ.configs import DEEQU_MAVEN_COORD
     22 from pydeequ.profiles import ColumnProfilerRunner
     23 

/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
    165             # Import the desired module. If you’re seeing this while debugging a failed import,
    166             # look at preceding stack frames for relevant error information.
--> 167             original_result = python_builtin_import(name, globals, locals, fromlist, level)
    168 
    169             is_root_import = thread_local._nest_level == 1

/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in <module>
     35 
     36 
---> 37 DEEQU_MAVEN_COORD = _get_deequ_maven_config()
     38 IS_DEEQU_V1 = re.search("com\.amazon\.deequ\:deequ\:1.*", DEEQU_MAVEN_COORD) is not None

/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_deequ_maven_config()
     26 
     27 def _get_deequ_maven_config():
---> 28     spark_version = _get_spark_version()
     29     try:
     30         return SPARK_TO_DEEQU_COORD_MAPPING[spark_version[:3]]

/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_spark_version()
     21     ]
     22     output = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
---> 23     spark_version = output.stdout.decode().split("\n")[-2]
     24     return spark_version
     25 

IndexError: list index out of range

Can you please help me find the reason for this or an alternative way to get the library without the PyPI.

Thanks in advance!

CodePudding user response:

I assumed I wouldn't need to add the Deequ library. Apparently, all I had to do was add it via Maven coordinates and it solved the problem.

  • Related