Home > OS >  Pyspark running in docker container cannot write file
Pyspark running in docker container cannot write file

Time:01-24

I have a docker container running PySpark, hadoop and all the required dependecies. I am using spark-submit to query the minio and I want to write the output dataframe to the file. Reading the file works but writing does not. If I execute python in that container and try to create file at the same path, it works. Am I missing some spark configuration?

This is the error I get:


File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1109, in save 
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__ 
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o38.save
: java.net.ConnectException: Call From 10d3463d04ce/10.0.1.132 to localhost:9000 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

Relevant code:

spark = SparkSession.builder.getOrCreate()
spark_context = spark.sparkContext

spark_context._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'minio')
spark_context._jsc.hadoopConfiguration().set(
        'fs.s3a.secret.key', AWS_SECRET_ACCESS_KEY
    )
spark_context._jsc.hadoopConfiguration().set('fs.s3a.path.style.access', 'true')
spark_context._jsc.hadoopConfiguration().set(
        'fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem'
    )
spark_context._jsc.hadoopConfiguration().set('fs.s3a.endpoint', AWS_S3_ENDPOINT)
spark_context._jsc.hadoopConfiguration().set(
        'fs.s3a.connection.ssl.enabled', 'false'
    )

df = spark.sql(query)
df.show() # this works perfectly fine 
df.coalesce(1).write.format('json').save(output_path) # here I get the error

CodePudding user response:

Solution was to prepend file:// to output_path.

  • Related