I'd like to run a simple PySpark program on Kubernetes. This is just a shell PySpark program that doesn't do anything, but I'm having trouble with the basics. Once I get this fixed, I will add in more. Here is simpleapp.py
:
from pyspark.sql import SparkSession
print("simple pyspark app starting.")
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
print("created spark session")
spark.stop()
print("done")
I submit using spark-submit
from a local Spark 3.3.0 distribution:
~/opt/spark/current/bin/spark-submit \
--master k8s://<redacted>:443 \
--deploy-mode cluster \
--name pyspark-test \
--packages "org.apache.hadoop:hadoop-aws:3.3.2" \
--conf spark.executor.instances=1 \
--conf "spark.kubernetes.container.image=apache/spark-py:v3.3.0" \
--conf "spark.kubernetes.file.upload.path=s3a://pyspark-test" \
--conf "spark.hadoop.fs.s3a.access.key=<redacted>" \
--conf "spark.hadoop.fs.s3a.secret.key=<redacted>" \
--conf spark.kubernetes.namespace=pysparktest \
./simpleapp.py
I see the simpleapp.py
gets uploaded to S3, a pod is started on Kubernetes in the pysparktest
namespace, and then it errors. When I look at the logs, I see this:
kubectl -n pysparktest logs --tail=200 -lspark-app-name=pyspark-test
id -u
myuid=185
id -g
mygid=0
set e
getent passwd 185
uidentry=
set -e
'[' -z '' ']'
'[' -w /etc/passwd ']'
echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
'[' -z /usr/local/openjdk-11 ']'
SPARK_CLASSPATH=':/opt/spark/jars/*'
env
grep SPARK_JAVA_OPT_
sort -t_ -k4 -n
sed 's/[^=]*=\(.*\)/\1/g'
readarray -t SPARK_EXECUTOR_JAVA_OPTS
'[' -n '' ']'
'[' -z ']'
'[' -z ']'
'[' -n '' ']'
'[' -z ']'
'[' -z x ']'
SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*'
case "$1" in
shift 1
CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.124.35.246 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner s3a://pyspark-test/spark-upload-4c10a3df-14c5-46a9-a940-30548a7f586f/simpleapp.py
:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /opt/spark/.ivy2/cache
The jars for the packages stored in: /opt/spark/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-29bd23dc-1c5d-4550-ac2f-47ddbeb45f8e;1.0
confs: [default]
Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-29bd23dc-1c5d-4550-ac2f-47ddbeb45f8e-1.0.xml (No such file or directory)
at java.base/java.io.FileOutputStream.open0(Native Method)
at java.base/java.io.FileOutputStream.open(Unknown Source)
at java.base/java.io.FileOutputStream.<init>(Unknown Source)
at java.base/java.io.FileOutputStream.<init>(Unknown Source)
at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:71)
at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:63)
at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:553)
at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:183)
at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:259)
at org.apache.ivy.Ivy.resolve(Ivy.java:522)
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1454)
at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:185)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:901)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
This error doesn't make much sense to me. Ivy is a Java dependency tool, and I'm just trying to submit a super simple PySpark script.
Is there a simpler way that I can run a PySpark app on Kubernetes? Should I build a custom Docker image with my simpleapp.py
rather than the official image apache/spark-py:v3.3.0
?
CodePudding user response:
I needed to add this configuration setting to spark-submit
--conf "spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp" \
This is easy to miss, but it is in the documentation: https://spark.apache.org/docs/latest/running-on-kubernetes.html