Spark in docker can't open my file. It says the file doesn't exist-CodePudding

I built a cluster using docker-compose with one service of Jupyter Lab and another with Apache Spark. Here is my docker-compose.yaml.

version: '3'
services:
  jupyter-base-notebook:
    image: docker.io/jupyter/pyspark-notebook
    ports:
      - 8888:8888
    volumes:
      - ./data:/home/jovyan/work
  spark:
    image: docker.io/bitnami/spark:3
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '8080:8080'
  spark-worker:
    image: docker.io/bitnami/spark:3
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=4G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no

The services are working ok I suppose. I opened Jupyter Lab in my browser and connected to my apache spark using the following code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, regexp_replace
import os

spark = SparkSession.builder.master('spark://2833c5f3ee45:7077').getOrCreate()

My connection was successful as shown in the message below:

SparkSession - in-memory

SparkContext

Spark UI

Version
    v3.2.1
Master
    spark://2833c5f3ee45:7077
AppName
    pyspark-shell

However, when I try to load any file in the volume I mounted I get the following error:

df = spark.read.csv('adult.csv', sep=',', header=True, inferSchema=True, encoding='ISO-8859-1')

File file:/home/jovyan/work/adult.csv does not exist

The problem is when I test my path and files there... it's all right:

print(os.getcwd()) # /home/jovyan/work
print(os.listdir()) # ['.ipynb_checkpoints', 'Python_AP.ipynb', 'Datasets', 'adult.csv']

What am I missing? I'm relatively newbie in docker technologies and I'm not understanding what is going wrong. Thanks in advance.

CodePudding user response：

TL;DR: I updated my docker-compose file, and now it can find my file. I changed the path for reading also. Below the new docker-compose.yaml and the explanations.

version: '3'
services:
  jupyter-base-notebook:
    image: docker.io/jupyter/pyspark-notebook
    ports:
      - 8888:8888
    volumes:
      - ./data:/home/jovyan/work:rw
    networks:
      - spark-network
    user: root
    environment:
      - GRANT_SUDO=yes
      - JUPYTER_TOKEN=tad
      - SPARK_MASTER=spark://spark:7077
  spark:
    image: docker.io/bitnami/spark:3
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '8080:8080'
    networks:
      - spark-network
    volumes:
      - ./data:/home/jovyan/work:rw
  spark-worker:
    image: docker.io/bitnami/spark:3
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=4G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    networks:
      - spark-network
    volumes:
      - ./data:/home/jovyan/work:rw

networks:
  spark-network:
    driver: bridge

Here is the improvements made:

Shared the volume with all containers and ensured the read-write option.
Granted root access to jupyter-lab user so he/she can perform any needed change.
Set Spark_Master environment variable at jupyter-lab container to ensure it can reach the spark master container.
Added a common network to all the containers to ensure the comunication between them.

To finish I used the absolute path to read my file as follow:

file = 'file:////home/jovyan/work/adult.csv'
df = spark.read.csv(file, sep=',', header=True, inferSchema=True, encoding='ISO-8859-1')