Home > Back-end >  Additional functionalities of datetime subclass is lost when inserted into pandas dataframe
Additional functionalities of datetime subclass is lost when inserted into pandas dataframe

Time:09-27

I want to extend the datetime class to add some additional functionalities. Therefore, following the indications available e.g., here and here, I have prepared the following class:

import datetime 

class CPyTime(datetime.datetime):
    def __new__(cls, year, month=0, day=1):
        return super().__new__(cls, year, month, day)

    # Additional constructors
    @classmethod
    def from_my_own_date(cls, year_string=None, month_string=None):
        year = int(year_string)
        month = int(month_string)
        obj = cls(year, month)
        assert isinstance(obj, cls), "{}: wrong object type returned".format(CPyTime.from_my_own_date.__name__)
        return obj

    @property
    def year_plus_month(self):
        return self.year   self.month

The class seems to work fine by itself, as shown in the following code snippet:

>>> my_date = CPyTime(2021, 10)
my_date_custom = CPyTime.from_my_own_date("2021", "12")
print(f"{my_date}, {my_date.year_plus_month}")
print(f"{my_date_custom}, {my_date_custom.year_plus_month}")
2021-10-01 00:00:00, 2031
2021-12-01 00:00:00, 2033

>>> type(my_date)
<class '__main__.CPyTime'>

The problem I face is that when the class is used inside a pandas dataframe pandas seem to automatically convert from CPyTime to TimeStamp and the additional functionalities of CPyTime are therefore lost. The following code snippet shows the problem:

import pandas as pd 
pdf = pd.DataFrame(data=[[2021, 1], [2021, 2], [2021, 3], [2021, 4]], columns=["Year", "Month"])
pdf["OwnDate"] = pdf.apply(lambda row: CPyTime(row["Year"], row["Month"]), axis=1)

Then, the dataframe is created and contains the new column "OwnDate":

pdf
   Year  Month    OwnDate
0  2021      1 2021-01-01
1  2021      2 2021-02-01
2  2021      3 2021-03-01
3  2021      4 2021-04-01

However, the data type of the "OwnDate" column is datetime and the additional functionalities of CPyTime are not available:

>>> pdf.dtypes
Year                int64
Month               int64
OwnDate    datetime64[ns]
dtype: object

>>> pdf["OwnDate"][0].year_plus_month

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2020.1\plugins\python-ce\helpers\pydev\_pydevd_bundle\pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
AttributeError: 'Timestamp' object has no attribute 'year_plus_month'

Can anyone please help to sort out this problem? Is it possible to use a derived datetime class in a pandas dataframe without actually losing the additional functionalities of the derived class?

CodePudding user response:

An option could be to create a Series from a list, setting the dtype explicitly to object:

import pandas as pd 
pdf = pd.DataFrame(data=[[2021, 1], [2021, 2], [2021, 3], [2021, 4]], columns=["Year", "Month"])

pdf["OwnDate"] = pd.Series(
                    [CPyTime(row["Year"], row["Month"]) for _, row in pdf.iterrows()], 
                    dtype='object'
                    )

print(pdf.dtypes)
# Year        int64
# Month       int64
# OwnDate    object
# dtype: object

print(pdf["OwnDate"][0].year_plus_month)
# 2022

See also How to prevent Pandas from converting datetimes to datetime64.

  • Related