I've created a docker image of a program that has the findspark.init()
function in it. The program runs well on the local machine. When I try to run the image with docker run -p 5000:5000 imgname:latest
, I get the following error:
Traceback (most recent call last):
File "app.py", line 37, in <module>
findspark.init()
File "/usr/local/lib/python3.8/site-packages/findspark.py", line 129, in init
spark_home = find()
File "/usr/local/lib/python3.8/site-packages/findspark.py", line 35, in find
raise ValueError(
ValueError: Couldn't find Spark, make sure SPARK_HOME env is set or Spark is in an expected location (e.g. from homebrew installation).
Can anyone suggest a way around this problem? When I try to make the program without the findspark function, I'm getting other errors related to Spark. This is my dockerfile:
#Use python as base image
FROM python:3.8
#Use working dir app
WORKDIR /app
#Copy contents of current dir to /app
ADD . /app
#Install required packages
RUN pip3 install --upgrade pip
RUN pip3 install -r requirements.txt
#Open port 5000
EXPOSE 5000
#Set environment variable
ENV NAME analytic
#Run python program
CMD python app.py
Here is the part of the code where the image is stalling:
### multiple lines of importing libraries and then
# Spark imports
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import functions as F
The requirements.txt file can be seen on this link.
CodePudding user response:
Spark requires Java even if you're running pyspark, so you need to install java in your image.
In addition, if you're still using findspark
you can specify the SPARK_HOME
directory as well:
RUN apt-get update && apt-get install -y default-jre
ENV SPARK_HOME /usr/local/lib/python3.8/site-packages/pyspark
To summarize, your Dockerfile
should look like:
#Use python as base image
FROM python:3.8
RUN apt-get update && apt-get install -y default-jre
#Use working dir app
WORKDIR /app
#Copy contents of current dir to /app
ADD . /app
#Install required packages
RUN pip3 install --upgrade pip
RUN pip3 install -r requirements.txt
#Open port 5000
EXPOSE 5000
#Set environment variable
ENV NAME analytic
ENV SPARK_HOME /usr/local/lib/python3.8/site-packages/pyspark
#Run python program
CMD python app.py