This is part of new coursework I am doing. I am trying to install pyspark and I intend to use pyspark.pandas . I try to run a check on my package like this.
import pandas as pd
import numpy as np
import pyspark.pandas as ps
But as I run imports, I see the below error.
ImportError: cannot import name 'print_exec' from 'pyspark.cloudpickle' (C:\Users\smith\Anaconda3\lib\site-packages\pyspark\cloudpickle\__init__.py)
The pyspark version I am using is 3.1.3. I am not sure, I could be wrong at setting paths here. Is there a way I can verify the paths?? Or this could be any other issue please let me know.
Thanks
CodePudding user response:
Pandas API is available only for PySpark version 3.2, or above.
To upgrade PySpark to its latest release execute the following command:
!pip install -U --upgrade pyspark
Remove the "!" if you're not executing the command on a Jupyter Notebook.
After restarting your kernel import pyspark.pandas as ps
import should work.
Note
You can also check the PySpark version Python is importing like so:
import pyspark
print(pyspark.__version__)
# 3.3.0
Update
I've had a look at the history of changes made to broadcast.py
(that I believe is where the import is failing), and it seems they've changed the location of print_exc
from pyspark.cloudpickle
to pyspark.util
. Upgrading should really solve the issue.
Older version of broadcast.py
module:
Newer versions: