Home > database >  Import error with "import pyspark.pandas "
Import error with "import pyspark.pandas "

Time:06-27

This is part of new coursework I am doing. I am trying to install pyspark and I intend to use pyspark.pandas . I try to run a check on my package like this.

import pandas as pd

import numpy as np

import pyspark.pandas as ps

But as I run imports, I see the below error.

ImportError: cannot import name 'print_exec' from 'pyspark.cloudpickle' (C:\Users\smith\Anaconda3\lib\site-packages\pyspark\cloudpickle\__init__.py)

The pyspark version I am using is 3.1.3. I am not sure, I could be wrong at setting paths here. Is there a way I can verify the paths?? Or this could be any other issue please let me know.

Thanks

CodePudding user response:

Pandas API is available only for PySpark version 3.2, or above.

To upgrade PySpark to its latest release execute the following command:

!pip install -U --upgrade pyspark

Remove the "!" if you're not executing the command on a Jupyter Notebook.

After restarting your kernel import pyspark.pandas as ps import should work.

Note

You can also check the PySpark version Python is importing like so:

import pyspark

print(pyspark.__version__)
# 3.3.0

Update

I've had a look at the history of changes made to broadcast.py (that I believe is where the import is failing), and it seems they've changed the location of print_exc from pyspark.cloudpickle to pyspark.util. Upgrading should really solve the issue.

Older version of broadcast.py module:

https://github.com/apache/spark/blob/75ea89ad94ca76646e4697cf98c78d14c6e2695f/python/pyspark/broadcast.py#L24

Newer versions:

https://github.com/apache/spark/blob/8f744783531d4f62abdf82643b5eb34d54a2820b/python/pyspark/broadcast.py#L42

  • Related