I have a docker container running PySpark, hadoop and all the required dependecies. I am using spark-submit to query the minio and I want to write the output dataframe to the file. Reading the file works but writing does not. If I execute python in that container and try to create file at the same path, it works. Am I missing some spark configuration?
This is the error I get:
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1109, in save
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o38.save
: java.net.ConnectException: Call From 10d3463d04ce/10.0.1.132 to localhost:9000 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
Relevant code:
spark = SparkSession.builder.getOrCreate()
spark_context = spark.sparkContext
spark_context._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'minio')
spark_context._jsc.hadoopConfiguration().set(
'fs.s3a.secret.key', AWS_SECRET_ACCESS_KEY
)
spark_context._jsc.hadoopConfiguration().set('fs.s3a.path.style.access', 'true')
spark_context._jsc.hadoopConfiguration().set(
'fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem'
)
spark_context._jsc.hadoopConfiguration().set('fs.s3a.endpoint', AWS_S3_ENDPOINT)
spark_context._jsc.hadoopConfiguration().set(
'fs.s3a.connection.ssl.enabled', 'false'
)
df = spark.sql(query)
df.show() # this works perfectly fine
df.coalesce(1).write.format('json').save(output_path) # here I get the error
CodePudding user response:
Solution was to prepend file://
to output_path
.