Home > Software design >  How to works pickle dataframe inside
How to works pickle dataframe inside

Time:11-08

I wonder how the module "pickle" save and load objects. I saved a file with a dataframe object on the disk,

import pandas as pd
import pickle

df = pd.read_excel(r".\test.xlsx")

with open("o.pkl", "wb") as file:
    pickle.dump(df, file)

then I uninstalled pandas and tried to load the object dataframe from file, but i get error "Exception has occurred: ModuleNotFoundError No module named 'pandas'":

import pickle

with open("o.pkl", "rb") as file:
    e = pickle.load(file)

my question is, does the pickle module somehow use pandas when loading an df? If so how is it done?

CodePudding user response:

Pickle by default will go and import the class.

In this case, if you do not have pandas installed when you run the second snippet, it won't work by default (see below for more info on that default behaviour).

Quick primer on pickling

Essentially, everything in Python is an instance of a class, in some shape or form.

When you make a DataFrame, such as when you use pandas.read_excel, you create an instance of a DataFrame class. To create that class you need:

  • the class definition (containing information about methods and attributes)
  • something that creates the instance from some input data

You can create instances of a class normally by directly instantiating the class, or by using another method/function. Example:

# This makes a string, '12345' by directly invoking the str constructor
s = str(12345)

# This makes a list by using the split method of the string
l = s.split('3')

Pickle works just the same. When you unpickle, you need the class definition as well as the function which transforms some input data (your .pkl file) into the instance.

The class definition will be available in the pickled data, but none of the other supporting imports code outside of the class will be.

This means that even if you override the default behaviour, while you might be able to make a DataFrame, your DataFrame won't work because you're missing pandas. When you try to invoke a method on the DataFrame, Python will try to access code that doesn't live in the original class definition. This code lives in other modules in the pandas module, and so this will never be captured in the pickle -- your code will then become quite unhappy at this point.

Can I override the default behaviour for unpickling?

Yes, you can do this -- you can override the import behaviour by using a custom unpickler. That's described here in the Python doc: restricting globals (Python official doc).

CodePudding user response:

I've run into a similar thing before where it needed a specific pandas version, but I didn't investigate. Running across your post here, I read some of the documentation and came across this line:

When a class instance is unpickled, its __init__() method is usually not invoked. The default behaviour first creates an uninitialized instance and then restores the saved attributes.

https://docs.python.org/3.8/library/pickle.html#pickle-inst

So to unpickle an arbitrary class instance, it has to be able to access the initialization method of that class. If the class isn't present, it can't do that.

That same page also says:

Similarly, when class instances are pickled, their class’s code and data are not pickled along with them. Only the instance data are pickled.

If I make a pandas DataFrame, I can access df.__class__ which will return pandas.core.frame.DataFrame

Putting this all together on that page, here's what I think happens:

  • Pickling df saves the instance data, which includes the __class__ attribute
  • Unpickling goes and looks for this class to access its __setstate__ method
  • If the module containing this class definition can't be found: error!

Short answer: it saves that information.

  • Related