I built a cluster using docker-compose with one service of Jupyter Lab and another with Apache Spark. Here is my docker-compose.yaml.
version: '3'
services:
jupyter-base-notebook:
image: docker.io/jupyter/pyspark-notebook
ports:
- 8888:8888
volumes:
- ./data:/home/jovyan/work
spark:
image: docker.io/bitnami/spark:3
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
ports:
- '8080:8080'
spark-worker:
image: docker.io/bitnami/spark:3
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=4G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
The services are working ok I suppose. I opened Jupyter Lab in my browser and connected to my apache spark using the following code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, regexp_replace
import os
spark = SparkSession.builder.master('spark://2833c5f3ee45:7077').getOrCreate()
My connection was successful as shown in the message below:
SparkSession - in-memory
SparkContext
Spark UI
Version
v3.2.1
Master
spark://2833c5f3ee45:7077
AppName
pyspark-shell
However, when I try to load any file in the volume I mounted I get the following error:
df = spark.read.csv('adult.csv', sep=',', header=True, inferSchema=True, encoding='ISO-8859-1')
File file:/home/jovyan/work/adult.csv does not exist
The problem is when I test my path and files there... it's all right:
print(os.getcwd()) # /home/jovyan/work
print(os.listdir()) # ['.ipynb_checkpoints', 'Python_AP.ipynb', 'Datasets', 'adult.csv']
What am I missing? I'm relatively newbie in docker technologies and I'm not understanding what is going wrong. Thanks in advance.
CodePudding user response:
TL;DR: I updated my docker-compose file, and now it can find my file. I changed the path for reading also. Below the new docker-compose.yaml and the explanations.
version: '3'
services:
jupyter-base-notebook:
image: docker.io/jupyter/pyspark-notebook
ports:
- 8888:8888
volumes:
- ./data:/home/jovyan/work:rw
networks:
- spark-network
user: root
environment:
- GRANT_SUDO=yes
- JUPYTER_TOKEN=tad
- SPARK_MASTER=spark://spark:7077
spark:
image: docker.io/bitnami/spark:3
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
ports:
- '8080:8080'
networks:
- spark-network
volumes:
- ./data:/home/jovyan/work:rw
spark-worker:
image: docker.io/bitnami/spark:3
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=4G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
networks:
- spark-network
volumes:
- ./data:/home/jovyan/work:rw
networks:
spark-network:
driver: bridge
Here is the improvements made:
- Shared the volume with all containers and ensured the read-write option.
- Granted root access to jupyter-lab user so he/she can perform any needed change.
- Set Spark_Master environment variable at jupyter-lab container to ensure it can reach the spark master container.
- Added a common network to all the containers to ensure the comunication between them.
To finish I used the absolute path to read my file as follow:
file = 'file:////home/jovyan/work/adult.csv'
df = spark.read.csv(file, sep=',', header=True, inferSchema=True, encoding='ISO-8859-1')