Is it possible to use a docker image that has both pyspark and pandas installed?-CodePudding

My flask application uses pandas and pyspark.

I created a Dockerfile where it uses a docker Pandas image:

FROM amancevice/pandas
RUN mkdir /app
ADD . /app
WORKDIR /app
EXPOSE 5000
RUN pip install -r requirements.txt
CMD ["python", "app.py"]

In requirements.txt I have:

flask
pymysql
sqlalchemy
passlib
hdfs
Werkzeug
pandas
pyspark

Where I´m using pyspark is in this function (it was just an example for verifying if it works):

from pyspark.sql import SparkSession

@app.route('/home/search', methods=["GET", "POST"])
def search_tab():
    if 'loggedin' in session:
        user_id = 'user'   str(session['id'])

        if request.method == 'POST':
            checkboxData = request.form.getlist("checkboxData")

            for cd in checkboxData:
                if cd.endswith(".csv"):
                    data_hdfs(user_id, cd)
                else:
                    print("xml")

            return render_template("search.html", id=session['id'])
    return render_template('login.html')


def data_hdfs(user_id, cd):
    #spark session
    warehouse_location ='hdfs://hdfs-nn:9000/flask_platform'

    spark = SparkSession \
        .builder \
        .master("local[2]") \
        .appName("read csv") \
        .config("spark.sql.warehouse.dir", warehouse_location) \
        .getOrCreate()

    raw_data = spark.read.options(header='True', delimiter=';').csv("hdfs://hdfs-nn:9000" cd)

    raw_data.repartition(1).write.format('csv').option('header',True).mode('overwrite').option('sep',';').save("hdfs://hdfs-nn:9000/flask_platform/" user_id "/staging_area/mapped_files/mapped_file_4.csv")

    return spark.stop()

But when I try to use code in pyspark, I got this error:

JAVA_HOME is not set
172.20.0.1 - - [15/Apr/2022 11:58:16] "POST /home/search HTTP/1.1" 500 -
Traceback (most recent call last):
   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2095, in __call__
     return self.wsgi_app(environ, start_response)
   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2080, in wsgi_app
     response = self.handle_exception(e)
   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2077, in wsgi_app
     response = self.full_dispatch_request()
   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1525, in full_dis
     rv = self.handle_user_exception(e)
   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1523, in full_dis
     rv = self.dispatch_request()
   File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1509, in dispatch
     return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
   File "/app/app.py", line 243, in search_tab
     data_hdfs(user_id, cd)
   File "/app/app.py", line 255, in data_hdfs
     spark = SparkSession \
   File "/usr/local/lib/python3.9/site-packages/pyspark/sql/session.py", line 228, in
     sc = SparkContext.getOrCreate(sparkConf)
   File "/usr/local/lib/python3.9/site-packages/pyspark/context.py", line 392, in get
     SparkContext(conf=conf or SparkConf())
   File "/usr/local/lib/python3.9/site-packages/pyspark/context.py", line 144, in __i
     SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
   File "/usr/local/lib/python3.9/site-packages/pyspark/context.py", line 339, in _en
     SparkContext._gateway = gateway or launch_gateway(conf)
   File "/usr/local/lib/python3.9/site-packages/pyspark/java_gateway.py", line 108, i
     raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number

Is it possible to use a docker image that has both pyspark and pandas installed? If so, where can I find it? Because I need to use both in my project. Thanks

CodePudding user response：

pyspark (aka Spark) requires java, which doesn't seems to be installed in your image.

You can try something like:

FROM amancevice/pandas

RUN apt-get update \
  && apt-get install -y --no-install-recommends \
         openjdk-11-jre-headless \
  && apt-get autoremove -yqq --purge \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

RUN pip install -r requirements.txt
RUN mkdir /app
ADD . /app
WORKDIR /app
EXPOSE 5000
CMD ["python", "app.py"]

Note that I also moved your requrements.txt installation before adding your code. This will save your time by using docker cache.