Understanding the jars in pyspark-CodePudding

I'm new to spark and my understanding is this:

jars are like a bundle of java code files
Each library that I install that internally uses spark (or pyspark) has its own jar files that need to be available with both driver and executors in order for them to execute the package API calls that the user interacts with. These jar files are like the backend code for those API calls

Questions:

Why are these jar files needed. Why could it not have sufficed to have all the code in python? (I guess the answer is that originally Spark is written in scala and there it distributes its dependencies as jars. So to not have to create that codebase mountain again, the python libraries just call that javacode in python interpreter through some converter that converts java code to equivalent python code. Please if I have understood right)
You specify these jar files locations while creating the spark context via spark.driver.extraClassPath and spark.executor.extraClassPath. These are outdated parameters though I guess. What is the recent way to specify these jar files location?
Where do I find these jars for each library that I install? For example synapseml. What is the general idea about where the jar files for a package are located? Why do not the libraries make it clear where their specific jar files are going to be?

I understand I might not be making sense here and what I have mentioned above is partly just my hunch that that is how it must be happening.

So, can you please help me understand this whole business with jars and how to find and specify them?

CodePudding user response：

Each library that I install that internally uses spark (or pyspark) has its own jar files

Can you tell which library are you trying to install ?

Yes, external libraries can have jars even if you are writing code in python.

Why ?

These libraries must be using some UDF (User Defined Functions). Spark runs the code in java runtime. If these UDF are written in python, then there will be lot of serialization and deserialization time due to converting data into something readable by python.

Java and Scala UDFs are usually faster that's why some libraries ship with jars.

Why could it not have sufficed to have all the code in python?

Same reason, scala/java UDFs are faster than python UDF.

What is the recent way to specify these jar files location?

You can use spark.jars.packages property. It will copy to both driver and executor.

Where do I find these jars for each library that I install? For example synapseml. What is the general idea about where the jar files for a package are located?

https://github.com/microsoft/SynapseML#python

They have mentioned here what jars are required i.e. com.microsoft.azure:synapseml_2.12:0.9.4

import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
            .config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.9.4") \
            .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
            .getOrCreate()
import synapse.ml

Can you try the above snippet?