Home > Net >  PySpark 3.2.1 - basic actions crashing on very small RDDs
PySpark 3.2.1 - basic actions crashing on very small RDDs

Time:03-07

I am testing simple PySpark functions locally in JupyterLab using Python 3.9.10 in a conda-forge environment on Windows 10.

Setup:

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local[4]").setAppName("PySpark_Test") 
sc = SparkContext(conf=conf) # Default SparkContext

# Verify SparkContext
print(sc)

# Print the Spark version of SparkContext
print("The version of Spark Context in the PySpark shell is:", sc.version)

# Print the Python version of SparkContext
print("The Python version of Spark Context in the PySpark shell is:", sc.pythonVer)

# Print Master of SparkContext: URL of the cluster or “local” string to run in local mode 
print("The master of Spark Context in the PySpark shell is:", sc.master)

# Print name of SparkContext
print("The appName of Spark Context is:", sc.appName)

Output:

<SparkContext master=local[4] appName=PySpark_Test>
The version of Spark Context in the PySpark shell is: 3.2.1
The Python version of Spark Context in the PySpark shell is: 3.9
The master of Spark Context in the PySpark shell is: local[4]
The appName of Spark Context is: PySpark_Test

# Create a Python list of numbers from 1 to 10
numb = [*range(1, 11)]

# Load the list into PySpark  
numbRDD = sc.parallelize(numb)

numbRDD.collect()

Output:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

type(numbRDD)

Output:

pyspark.rdd.RDD

cubedRDD = numbRDD.map(lambda x: x**3)
type(cubedRDD)

Output:

pyspark.rdd.PipelinedRDD

Everything above works fine.

I am getting crashing when I run almost any action on this very small cubedRDD.

These actions will crash:

  • cubedRDD.collect()
  • cubedRDD.first()
  • cubedRDD.take(2)
  • cubedRDD.count()

Error excerpt:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 8.0 failed 1 times, most recent failure: Lost task 3.0 in stage 8.0 (TID 30) (10.0.0.236 executor driver): org.apache.spark.SparkException: Python worker failed to connect back. ...

Caused by: java.net.SocketTimeoutException: Accept timed out ...

Caused by: org.apache.spark.SparkException: Python worker failed to connect back. ...

What is going wrong here?

CodePudding user response:

I solved the problem by setting the PYSPARK_PYTHON environment variable to point to the python.exe in my conda-forge environment (Miniconda).

PYSPARK_PYTHON = C:\Users\Me\Miniconda3\envs\spark39\python.exe

I finally found this page in the PySpark docs that told me about that env variable. No other tutorial or installation instructions for PySpark I have found mentioned that variable.

I also turned off the App Execution Aliases for "App Installer" for python.exe and python3.exe. I only have Python installed via Miniconda.

Everything is running fine with jdk1.8.0_321 installed.

  • Related