I have problems reading files into data frames when running Spark on Docker.
Here's my docker-compose.yml:
version: '2'
services:
spark:
image: docker.io/bitnami/spark:3.3
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
ports:
- '8080:8080'
- '7077:7077'
spark-worker:
image: docker.io/bitnami/spark:3.3
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
It's the basic definition file provided with Bitnami Spark Docker image with added 7077 port.
When I run this simple script, which doesn't read anything from the disk, it works:
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.master("spark://localhost:7077").appName("test").getOrCreate()
d = [
[1, 1],
[2, 2],
[3, 3],
]
df = spark.createDataFrame(d)
df.show()
spark.stop()
if __name__ == "__main__":
main()
Output is as expected:
--- ---
| _1| _2|
--- ---
| 1| 1|
| 2| 2|
| 3| 3|
--- ---
From this I assume that the issue is not with the Spark cluster. However, when I try to read files from local drive, it doesn't work:
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.master("spark://localhost:7077").appName("test").getOrCreate()
employees = spark.read.csv('./data/employees.csv', header=True)
salaries = spark.read.csv('./data/salaries.csv', header=True)
employees.show()
salaries.show()
spark.stop()
if __name__ == "__main__":
main()
I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o27.csv. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (192.168.112.2 executor 0): java.io.FileNotFoundException: File file:/Users/UserName/Projects/spark/test/data/employees.csv does not exist
The file is there. When I run the script with local PySpark library, by defining Spark session like this: spark = SparkSession.builder.appName("test").getOrCreate()
, it works. Should I somehow add data directory as a volume to the container? I've tried that also but I haven't gotten it to work.
Any advice?
CodePudding user response:
looks like you're starting up some docker containers with docker-compose
but not mounting any volumes. It makes sense that Spark does not find those files in that case, since they do not exist within the containers.
Imagine your container to be another physical machine than the one you're running your Spark script on. How would it be able to find those files? Well, you could for example put a USB stick inside of the other computer with the necessary data on there.
For your containers to be able to access these files, you'll need to mount a volume on your containers. This is a bit (loosely speaking) like putting a USB stick inside of that other machine.
You can do that by using the volumes
keyword in your docker-compose.yml:
version: '2'
services:
spark:
image: docker.io/bitnami/spark:3.3
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
ports:
- '8080:8080'
- '7077:7077'
volumes:
- ./:/mounted-data
spark-worker:
image: docker.io/bitnami/spark:3.3
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- ./:/mounted-data
Notice the ./:/mounted-data
bit. The pattern is path-on-your-machine:path-on-container
.
So, this will mount your local .
path (on which your data is located) to /mounted-data
within your containers. Note that I added this to both your spark
and spark-worker
services, since I'm not familiar with the bitnami setup but it might be enough to only add that volume on the spark-worker
service.
Now that the data is available on the container, you just need to properly point to it in your code. You should be able to read the data like so within your larger spark script:
employees = spark.read.csv('/mounted-data/data/employees.csv', header=True)
salaries = spark.read.csv('/mounted-data/data/salaries.csv', header=True)
If something went wrong here, try the following:
- go inside of your container using the following command:
docker exec -it container-name bash
cd
to your mounted data folder. If you used the example above, that would becd /mounted-data
- If that does not work, that means something went wrong while mounting the volume.
- Have a look at what is in there, by using
ls -al
- If that does not work, you might have permission problems on your volume, which is discussed in this SO post.
I hope this helps! :)