Home > Software engineering >  How to add jar files to $SPARK_HOME/jars correctly?
How to add jar files to $SPARK_HOME/jars correctly?

Time:08-02

I have used this command and it works fine:

spark = SparkSession.builder.appName('Apptest')\
    .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.3.5').getOrCreate()

But I'd like to download the jar file and always start with:

spark = SparkSession.builder.appName('Apptest').getOrCreate()

How can I do it? I have tried:

  1. Move to SPARK_HOME jar dir:

    cd /de/spark-2.4.6-bin-hadoop2.7/jars

  2. Download jar file

    curl https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.11/2.3.5/mongo-spark-connector_2.11-2.3.5.jar --output mongo-spark-connector_2.11-2.3.5.jar

But spark don't see it. I got the following error:

Py4JJavaError: An error occurred while calling o66.save.
: java.lang.NoClassDefFoundError: com/mongodb/ConnectionString

I know there is ./spark-shell --jar command, but I am using jupyter notebook. Is there some step missing?

CodePudding user response:

Since you're using SparkSession in the jupyter notebook, unfortunately you have to use the .config('spark.jars.packages', '...') to add the jars that you want when you're creating the spark object.

Instead, if you want to add the jar in "default" mode when you launch the notebook, I would recommend you to create a custom kernel, so that every time when you create a new notebook, you even don't need to create the spark. If you're using Anaconda, you can check the docs: https://docs.anaconda.com/ae-notebooks/admin-guide/install/config/custom-pyspark-kernel/

CodePudding user response:

What I was looking for is .config("spark.jars",".."):

spark = SparkSession.builder.appName('Test')\
.config("spark.jars", "/root/mongo-spark-connector_2.11-2.3.5.jar,/root/mongo-java-driver-3.12.5.jar") \
.getOrCreate()

Or:

import os
os.environ["PYSPARK_SUBMIT_ARGS"]="--jars /root/mongo-spark-connector_2.11-2.3.5.jar,/root/mongo-java-driver-3.12.5.jar pyspark-shell"

Also, seems that only put the jar files in $SPARK_HOME/jars works fine as well, but in my case and question example was missing the dependency mongo-java-driver-3.12.5.jar. After download all dependencies in $SPARK_HOME/jars I was able to run only with:

spark = SparkSession.builder.appName('Test').getOrCreate()

I have find out the dependencies in: https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector_2.11/2.3.5

  • Related