Import error with "import pyspark.pandas "-CodePudding

This is part of new coursework I am doing. I am trying to install pyspark and I intend to use pyspark.pandas . I try to run a check on my package like this.

import pandas as pd

import numpy as np

import pyspark.pandas as ps

But as I run imports, I see the below error.

ImportError: cannot import name 'print_exec' from 'pyspark.cloudpickle' (C:\Users\smith\Anaconda3\lib\site-packages\pyspark\cloudpickle\__init__.py)

The pyspark version I am using is 3.1.3. I am not sure, I could be wrong at setting paths here. Is there a way I can verify the paths?? Or this could be any other issue please let me know.

Thanks

CodePudding user response：

Pandas API is available only for PySpark version 3.2, or above.

To upgrade PySpark to its latest release execute the following command:

!pip install -U --upgrade pyspark

Remove the "!" if you're not executing the command on a Jupyter Notebook.

After restarting your kernel import pyspark.pandas as ps import should work.

Note

You can also check the PySpark version Python is importing like so:

import pyspark

print(pyspark.__version__)
# 3.3.0

Update

I've had a look at the history of changes made to broadcast.py (that I believe is where the import is failing), and it seems they've changed the location of print_exc from pyspark.cloudpickle to pyspark.util. Upgrading should really solve the issue.

Older version of broadcast.py module:

https://github.com/apache/spark/blob/75ea89ad94ca76646e4697cf98c78d14c6e2695f/python/pyspark/broadcast.py#L24

Newer versions:

https://github.com/apache/spark/blob/8f744783531d4f62abdf82643b5eb34d54a2820b/python/pyspark/broadcast.py#L42