Home > Mobile >  Unable to run docker image with findspark.init
Unable to run docker image with findspark.init

Time:12-21

I've created a docker image of a program that has the findspark.init() function in it. The program runs well on the local machine. When I try to run the image with docker run -p 5000:5000 imgname:latest, I get the following error:

Traceback (most recent call last):
  File "app.py", line 37, in <module>
    findspark.init()
  File "/usr/local/lib/python3.8/site-packages/findspark.py", line 129, in init
    spark_home = find()
  File "/usr/local/lib/python3.8/site-packages/findspark.py", line 35, in find
    raise ValueError(
ValueError: Couldn't find Spark, make sure SPARK_HOME env is set or Spark is in an expected location (e.g. from homebrew installation).

Can anyone suggest a way around this problem? When I try to make the program without the findspark function, I'm getting other errors related to Spark. This is my dockerfile:

#Use python as base image
FROM python:3.8

#Use working dir app
WORKDIR /app

#Copy contents of current dir to /app
ADD . /app

#Install required packages
RUN pip3 install --upgrade pip
RUN pip3 install -r requirements.txt

#Open port 5000
EXPOSE 5000

#Set environment variable
ENV NAME analytic

#Run python program
CMD python app.py

Here is the part of the code where the image is stalling:

    ### multiple lines of importing libraries and then    
    # Spark imports
    import findspark
    findspark.init()
    
    import pyspark
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import *
    from pyspark.sql.types import *
    from pyspark.sql import functions as F

The requirements.txt file can be seen on this link.

CodePudding user response:

Spark requires Java even if you're running pyspark, so you need to install java in your image. In addition, if you're still using findspark you can specify the SPARK_HOME directory as well:

RUN apt-get update && apt-get install -y default-jre
ENV SPARK_HOME /usr/local/lib/python3.8/site-packages/pyspark

To summarize, your Dockerfile should look like:

#Use python as base image
FROM python:3.8

RUN apt-get update && apt-get install -y default-jre

#Use working dir app
WORKDIR /app

#Copy contents of current dir to /app
ADD . /app

#Install required packages
RUN pip3 install --upgrade pip
RUN pip3 install -r requirements.txt

#Open port 5000
EXPOSE 5000

#Set environment variable
ENV NAME analytic
ENV SPARK_HOME /usr/local/lib/python3.8/site-packages/pyspark

#Run python program
CMD python app.py
  • Related