Home > Back-end >  How to include external python modules with pyspark
How to include external python modules with pyspark

Time:10-14

I'm new in python and trying to launch my pyspark project on spark on AWS EMR. The project is disposed on AWS S3 and has several python files, like this:

/folder1
 - main.py
/utils
 - utils1.py
 - utils2.py

I use the following command:

spark-submit --py-files s3://bucket/utils s3://bucket/folder1/main.py

But I get the error:

Traceback (most recent call last):
  File "/mnt/tmp/spark-1e38eb59-3ddd-4deb-8529-eace7465b6ce/main.py", line 15, in <module>
    from utils.utils1 import foo
ModuleNotFoundError: No module named 'utils'

What I have to fix in my command? I know that I can pack my project in zip file, but now I need to do it without packing, however I'll be grateful if you tell me both solutions.

UPD:

EMR cluster's controller log says, that launching command looks like this:

hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --packages org.apache.spark:spark-avro_2.12:3.1.1 --driver-memory 100G --conf spark.driver.maxResultSize=100G --conf spark.hadoop.fs.s3.maxRetries=20 --conf spark.sql.sources.partitionOverwriteMode=dynamic --py-files s3://bucket/dir1/dir2/utils.zip --master yarn s3://bucket/dir1/dir2/dir3/main.py --args

But now I have the following error: java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/cluster-id/dir1/dir2/utils.zip does not exist

What's wrong?

CodePudding user response:

Although not recommended (see the complete answer for a better option), but if you do not want to zip files. Instead of providing utils folder, you can provide individual utils-* files with py-files with comma-separated syntax before the actual file as

'Args': ['spark-submit',
                '--py-files',
                '{your_s3_path_here}/utils/utils1.py,{your_s3_path_here}/utils/utils1.py',
                'main.py']
        }

Better to zip utils folder

You can zip utils and include like this

To do so, make empty __init__.py file at root level in utils like utils/__init__.py )

From outside this directory, make a zip of it (for example utils.zip)

For submission, you can add this zip as

'Args': ['spark-submit',
                '--py-files',
                '{your_s3_path_here}/utils.zip',
                'main.py'
        }

Considering your have __init__.py , utils1.py, utils2.py in utils.zip

Note: You might also need to add this zip to sc with sc.addPyFile("utils.zip") before following imports

You can now use them as

from utils.utils1 import *
from utils.utils2 import *
  • Related