I want to do some tests regarding data quality and for that I pretend to use PyDeequ on a databricks notebook. Keep in mind that I'm very new to databricks and Spark.
First, I created a cluster with the Runtime version "10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)" and added to the environment variable SPARK_VERSION=3.2
, as referred in the repository's GitHub.
Since the available PyPI package is not up to date I tried installing the package through a notebook-scoped library with the following comand
%pip install numpy==1.22 %pip install git https://github.com/awslabs/python-deequ.git
(The first line is only to prevent a conflict on the numpy versions.)
Then, when doing
import pydeequ
I get
IndexError Traceback (most recent call last)
<command-3386600260354339> in <module>
----> 1 import pydeequ
/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
165 # Import the desired module. If you’re seeing this while debugging a failed import,
166 # look at preceding stack frames for relevant error information.
--> 167 original_result = python_builtin_import(name, globals, locals, fromlist, level)
168
169 is_root_import = thread_local._nest_level == 1
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/__init__.py in <module>
19 from pydeequ.analyzers import AnalysisRunner
20 from pydeequ.checks import Check, CheckLevel
---> 21 from pydeequ.configs import DEEQU_MAVEN_COORD
22 from pydeequ.profiles import ColumnProfilerRunner
23
/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
165 # Import the desired module. If you’re seeing this while debugging a failed import,
166 # look at preceding stack frames for relevant error information.
--> 167 original_result = python_builtin_import(name, globals, locals, fromlist, level)
168
169 is_root_import = thread_local._nest_level == 1
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in <module>
35
36
---> 37 DEEQU_MAVEN_COORD = _get_deequ_maven_config()
38 IS_DEEQU_V1 = re.search("com\.amazon\.deequ\:deequ\:1.*", DEEQU_MAVEN_COORD) is not None
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_deequ_maven_config()
26
27 def _get_deequ_maven_config():
---> 28 spark_version = _get_spark_version()
29 try:
30 return SPARK_TO_DEEQU_COORD_MAPPING[spark_version[:3]]
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5ccb9322-9b7e-4caf-b370-843c10304472/lib/python3.8/site-packages/pydeequ/configs.py in _get_spark_version()
21 ]
22 output = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
---> 23 spark_version = output.stdout.decode().split("\n")[-2]
24 return spark_version
25
IndexError: list index out of range
Can you please help me find the reason for this or an alternative way to get the library without the PyPI.
Thanks in advance!
CodePudding user response:
I assumed I wouldn't need to add the Deequ library. Apparently, all I had to do was add it via Maven coordinates and it solved the problem.