My flask application uses pandas and pyspark.
I created a Dockerfile where it uses a docker Pandas image:
FROM amancevice/pandas
RUN mkdir /app
ADD . /app
WORKDIR /app
EXPOSE 5000
RUN pip install -r requirements.txt
CMD ["python", "app.py"]
In requirements.txt I have:
flask
pymysql
sqlalchemy
passlib
hdfs
Werkzeug
pandas
pyspark
Where I´m using pyspark is in this function (it was just an example for verifying if it works):
from pyspark.sql import SparkSession
@app.route('/home/search', methods=["GET", "POST"])
def search_tab():
if 'loggedin' in session:
user_id = 'user' str(session['id'])
if request.method == 'POST':
checkboxData = request.form.getlist("checkboxData")
for cd in checkboxData:
if cd.endswith(".csv"):
data_hdfs(user_id, cd)
else:
print("xml")
return render_template("search.html", id=session['id'])
return render_template('login.html')
def data_hdfs(user_id, cd):
#spark session
warehouse_location ='hdfs://hdfs-nn:9000/flask_platform'
spark = SparkSession \
.builder \
.master("local[2]") \
.appName("read csv") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.getOrCreate()
raw_data = spark.read.options(header='True', delimiter=';').csv("hdfs://hdfs-nn:9000" cd)
raw_data.repartition(1).write.format('csv').option('header',True).mode('overwrite').option('sep',';').save("hdfs://hdfs-nn:9000/flask_platform/" user_id "/staging_area/mapped_files/mapped_file_4.csv")
return spark.stop()
But when I try to use code in pyspark, I got this error:
JAVA_HOME is not set
172.20.0.1 - - [15/Apr/2022 11:58:16] "POST /home/search HTTP/1.1" 500 -
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2095, in __call__
return self.wsgi_app(environ, start_response)
File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2080, in wsgi_app
response = self.handle_exception(e)
File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2077, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1525, in full_dis
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1523, in full_dis
rv = self.dispatch_request()
File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1509, in dispatch
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/app/app.py", line 243, in search_tab
data_hdfs(user_id, cd)
File "/app/app.py", line 255, in data_hdfs
spark = SparkSession \
File "/usr/local/lib/python3.9/site-packages/pyspark/sql/session.py", line 228, in
sc = SparkContext.getOrCreate(sparkConf)
File "/usr/local/lib/python3.9/site-packages/pyspark/context.py", line 392, in get
SparkContext(conf=conf or SparkConf())
File "/usr/local/lib/python3.9/site-packages/pyspark/context.py", line 144, in __i
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/usr/local/lib/python3.9/site-packages/pyspark/context.py", line 339, in _en
SparkContext._gateway = gateway or launch_gateway(conf)
File "/usr/local/lib/python3.9/site-packages/pyspark/java_gateway.py", line 108, i
raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number
Is it possible to use a docker image that has both pyspark and pandas installed? If so, where can I find it? Because I need to use both in my project. Thanks
CodePudding user response:
pyspark
(aka Spark) requires java, which doesn't seems to be installed in your image.
You can try something like:
FROM amancevice/pandas
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
openjdk-11-jre-headless \
&& apt-get autoremove -yqq --purge \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
RUN pip install -r requirements.txt
RUN mkdir /app
ADD . /app
WORKDIR /app
EXPOSE 5000
CMD ["python", "app.py"]
Note that I also moved your requrements.txt
installation before adding your code. This will save your time by using docker cache.