I'm new in python and trying to launch my pyspark project on spark on AWS EMR. The project is disposed on AWS S3 and has several python files, like this:
/folder1
- main.py
/utils
- utils1.py
- utils2.py
I use the following command:
spark-submit --py-files s3://bucket/utils s3://bucket/folder1/main.py
But I get the error:
Traceback (most recent call last):
File "/mnt/tmp/spark-1e38eb59-3ddd-4deb-8529-eace7465b6ce/main.py", line 15, in <module>
from utils.utils1 import foo
ModuleNotFoundError: No module named 'utils'
What I have to fix in my command? I know that I can pack my project in zip file, but now I need to do it without packing, however I'll be grateful if you tell me both solutions.
UPD:
EMR cluster's controller log says, that launching command looks like this:
hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --packages org.apache.spark:spark-avro_2.12:3.1.1 --driver-memory 100G --conf spark.driver.maxResultSize=100G --conf spark.hadoop.fs.s3.maxRetries=20 --conf spark.sql.sources.partitionOverwriteMode=dynamic --py-files s3://bucket/dir1/dir2/utils.zip --master yarn s3://bucket/dir1/dir2/dir3/main.py --args
But now I have the following error:
java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/cluster-id/dir1/dir2/utils.zip does not exist
What's wrong?
CodePudding user response:
Although not recommended (see the complete answer for a better option), but if you do not want to zip files. Instead of providing utils
folder, you can provide individual utils-* files with py-files with comma-separated syntax before the actual file as
'Args': ['spark-submit',
'--py-files',
'{your_s3_path_here}/utils/utils1.py,{your_s3_path_here}/utils/utils1.py',
'main.py']
}
Better to zip utils folder
You can zip
utils and include like this
To do so, make empty __init__.py
file at root level in utils
like utils/__init__.py )
From outside this directory, make a zip of it (for example utils.zip
)
For submission, you can add this zip as
'Args': ['spark-submit',
'--py-files',
'{your_s3_path_here}/utils.zip',
'main.py'
}
Considering your have __init__.py
, utils1.py
, utils2.py
in utils.zip
Note: You might also need to add this zip to sc with sc.addPyFile("utils.zip")
before following imports
You can now use them as
from utils.utils1 import *
from utils.utils2 import *