How to access files on local machine when running Spark on Docker-CodePudding

I have problems reading files into data frames when running Spark on Docker.

Here's my docker-compose.yml:

version: '2'

services:
  spark:
    image: docker.io/bitnami/spark:3.3
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '8080:8080'
      - '7077:7077'
  spark-worker:
    image: docker.io/bitnami/spark:3.3
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no

It's the basic definition file provided with Bitnami Spark Docker image with added 7077 port.

When I run this simple script, which doesn't read anything from the disk, it works:

from pyspark.sql import SparkSession

def main():
    spark = SparkSession.builder.master("spark://localhost:7077").appName("test").getOrCreate()

    d = [
            [1, 1],
            [2, 2],
            [3, 3],
        ]

    df = spark.createDataFrame(d)

    df.show()

    spark.stop()

if __name__ == "__main__":
    main()

Output is as expected:

 --- ---                                                                        
| _1| _2|
 --- --- 
|  1|  1|
|  2|  2|
|  3|  3|
 --- ---

From this I assume that the issue is not with the Spark cluster. However, when I try to read files from local drive, it doesn't work:

from pyspark.sql import SparkSession

def main():
    spark = SparkSession.builder.master("spark://localhost:7077").appName("test").getOrCreate()

    employees = spark.read.csv('./data/employees.csv', header=True)
    salaries = spark.read.csv('./data/salaries.csv', header=True)

    employees.show()
    salaries.show()

    spark.stop()

if __name__ == "__main__":
    main()

I get the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o27.csv. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (192.168.112.2 executor 0): java.io.FileNotFoundException: File file:/Users/UserName/Projects/spark/test/data/employees.csv does not exist

The file is there. When I run the script with local PySpark library, by defining Spark session like this: spark = SparkSession.builder.appName("test").getOrCreate(), it works. Should I somehow add data directory as a volume to the container? I've tried that also but I haven't gotten it to work.

Any advice?

CodePudding user response：

looks like you're starting up some docker containers with docker-compose but not mounting any volumes. It makes sense that Spark does not find those files in that case, since they do not exist within the containers.

Imagine your container to be another physical machine than the one you're running your Spark script on. How would it be able to find those files? Well, you could for example put a USB stick inside of the other computer with the necessary data on there.

For your containers to be able to access these files, you'll need to mount a volume on your containers. This is a bit (loosely speaking) like putting a USB stick inside of that other machine.

You can do that by using the volumes keyword in your docker-compose.yml:

version: '2'

services:
  spark:
    image: docker.io/bitnami/spark:3.3
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '8080:8080'
      - '7077:7077'
    volumes:
      - ./:/mounted-data

  spark-worker:
    image: docker.io/bitnami/spark:3.3
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - ./:/mounted-data

Notice the ./:/mounted-data bit. The pattern is path-on-your-machine:path-on-container. So, this will mount your local . path (on which your data is located) to /mounted-data within your containers. Note that I added this to both your spark and spark-worker services, since I'm not familiar with the bitnami setup but it might be enough to only add that volume on the spark-worker service.

Now that the data is available on the container, you just need to properly point to it in your code. You should be able to read the data like so within your larger spark script:

    employees = spark.read.csv('/mounted-data/data/employees.csv', header=True)
    salaries = spark.read.csv('/mounted-data/data/salaries.csv', header=True)

If something went wrong here, try the following:

go inside of your container using the following command: docker exec -it container-name bash
cd to your mounted data folder. If you used the example above, that would be cd /mounted-data
- If that does not work, that means something went wrong while mounting the volume.
Have a look at what is in there, by using ls -al
- If that does not work, you might have permission problems on your volume, which is discussed in this SO post.

I hope this helps! :)