I am testing simple PySpark functions locally in JupyterLab using Python 3.9.10 in a conda-forge environment on Windows 10.
Setup:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local[4]").setAppName("PySpark_Test")
sc = SparkContext(conf=conf) # Default SparkContext
# Verify SparkContext
print(sc)
# Print the Spark version of SparkContext
print("The version of Spark Context in the PySpark shell is:", sc.version)
# Print the Python version of SparkContext
print("The Python version of Spark Context in the PySpark shell is:", sc.pythonVer)
# Print Master of SparkContext: URL of the cluster or “local” string to run in local mode
print("The master of Spark Context in the PySpark shell is:", sc.master)
# Print name of SparkContext
print("The appName of Spark Context is:", sc.appName)
Output:
<SparkContext master=local[4] appName=PySpark_Test>
The version of Spark Context in the PySpark shell is: 3.2.1
The Python version of Spark Context in the PySpark shell is: 3.9
The master of Spark Context in the PySpark shell is: local[4]
The appName of Spark Context is: PySpark_Test
# Create a Python list of numbers from 1 to 10
numb = [*range(1, 11)]
# Load the list into PySpark
numbRDD = sc.parallelize(numb)
numbRDD.collect()
Output:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
type(numbRDD)
Output:
pyspark.rdd.RDD
cubedRDD = numbRDD.map(lambda x: x**3)
type(cubedRDD)
Output:
pyspark.rdd.PipelinedRDD
Everything above works fine.
I am getting crashing when I run almost any action on this very small cubedRDD.
These actions will crash:
- cubedRDD.collect()
- cubedRDD.first()
- cubedRDD.take(2)
- cubedRDD.count()
Error excerpt:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 8.0 failed 1 times, most recent failure: Lost task 3.0 in stage 8.0 (TID 30) (10.0.0.236 executor driver): org.apache.spark.SparkException: Python worker failed to connect back. ...
Caused by: java.net.SocketTimeoutException: Accept timed out ...
Caused by: org.apache.spark.SparkException: Python worker failed to connect back. ...
What is going wrong here?
CodePudding user response:
I solved the problem by setting the PYSPARK_PYTHON environment variable to point to the python.exe in my conda-forge environment (Miniconda).
PYSPARK_PYTHON = C:\Users\Me\Miniconda3\envs\spark39\python.exe
I finally found this page in the PySpark docs that told me about that env variable. No other tutorial or installation instructions for PySpark I have found mentioned that variable.
I also turned off the App Execution Aliases for "App Installer" for python.exe and python3.exe. I only have Python installed via Miniconda.
Everything is running fine with jdk1.8.0_321 installed.