I am trying to set up a local dev environment in docker with pyspark and delta lake.
I have gone through the compatibility of versions between delta lake and spark here.
I have the following in my Pipfile (using pipenv)
pyspark = {version = "==3.2.2", index = "artifactory-pypi"}
delta-spark={version= "==2.0.0", index = "artifactory-pypi"}
pytest = {version = "==7.1.2", index = "artifactory-pypi"}
pytest-cov ={version= "==3.0.0", index = "artifactory-pypi"}
...other packages
The artifactory-pypi is a mirror of pypi.
I have gone through this and trying to setup a python project for unit testing. The code already has this
_builder = (
SparkSession.builder.master("local[1]")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config(
"spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog",
)
)
spark: SparkSession = configure_spark_with_delta_pip(_builder).getOrCreate()
When I try to run my unit tests using pytest, it always fails at
configure_spark_with_delta_pip(_builder).getOrCreate()
with an error that it can't connect to maven repo to download
delta-core_2.12;2.0.0
I am not very well versed with Java, but I have done some digging to see that within
/usr/local/lib/python3.9/site-packages/pyspark/jars/
folder there is a ivy2.jar
file which apparently has info on what jars are needed and it tries to reach out to maven coordinates . This connection is refused as I am behind a corporate proxy.
I have tried setting
ENV HTTP_PROXY=proxyurl
ENV HTTPS_PROXY=proxyurl
According to this SO post, Maven does not honor http_proxy env variable and the answer suggested that the OP set those in some configuration files. But in there the OP was using a Maven image and thus had those conf files already. I do not have such files or folders as I am just using a python image. It is just that those python packages (pyspark) behind the scenes go and download jars.
I have also tried looking at spark runtime env variables especially spark.jars.repositories
to see if I could set them in my docker-compose.yml
but event that didn't work.
How can I get this to work? Can someone suggest either
- If it is possible to let this download go via a org artifactory? If so, where do I suggest it. e.g. for all my python packages, I am using a pypi mirror.
- I can also download the jars manually, but how and where do I copy them and what environment variables to set to make it work e.g. PATH ?
Btw, here is the full stack trace
:: loading settings :: url = jar:file:/usr/local/lib/python3.9/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-63bca6fa-4932-4c6d-b13f-4a339629fc26;1.0
confs: [default]
:: resolution report :: resolve 84422ms :: artifacts dl 0ms
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
module not found: io.delta#delta-core_2.12;2.0.0
==== local-m2-cache: tried
file:/root/.m2/repository/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom
-- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:
file:/root/.m2/repository/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar
==== local-ivy-cache: tried
/root/.ivy2/local/io.delta/delta-core_2.12/2.0.0/ivys/ivy.xml
-- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:
/root/.ivy2/local/io.delta/delta-core_2.12/2.0.0/jars/delta-core_2.12.jar
==== central: tried
https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom
-- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:
https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar
==== spark-packages: tried
https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom
-- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:
https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: io.delta#delta-core_2.12;2.0.0: not found
::::::::::::::::::::::::::::::::::::::::::::::
:::: ERRORS
Server access error at url https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom (java.net.ConnectException: Connection refused (Connection refused))
Server access error at url https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar (java.net.ConnectException: Connection refused (Connection refused))
Server access error at url https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom (java.net.ConnectException: Connection refused (Connection refused))
Server access error at url https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar (java.net.ConnectException: Connection refused (Connection refused))
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: io.delta#delta-core_2.12;2.0.0: not found]
Any questions and I can try to elaborate.
CodePudding user response:
Thanks to the S.O community, I have been able to solve this- basically by combining information from multiple S.O posts and answers.
So, this is how:
- As I mentioned in my question, I already had the env variables for proxy set
ENV HTTP_PROXY=proxyurl
ENV HTTPS_PROXY=proxyurl
- Then as mentioned in this post, especially the answer by Thomas Decaux.
- Then this post about where to find the spark-defaults.config file when you install pyspark through pip.
- Then this post about setting the relevant environment variables.
So, combining all of them, this was how I did it in my dockerfile
ENV HTTP_PROXY=proxyurl
ENV HTTPS_PROXY=proxyurl
WORKDIR /usr/local/lib/python3.9/site-packages/pyspark
RUN mkdir conf
WORKDIR /usr/local/lib/python3.9/site-packages/pyspark/conf
RUN echo "spark.driver.extraJavaOptions=-Djava.net.useSystemProxies=true" > spark-defaults.conf
ENV SPARK_HOME=/usr/local/lib/python3.9/site-packages/pyspark
ENV SPARK_PYTHON=python3
Other app specific stuffs