Home > Enterprise >  Pandas DataFrame with dates far in the future
Pandas DataFrame with dates far in the future

Time:08-26

I need to store some dates which are beyond the year-2262-limit. Since I only need dates and no times I thought I could get away with using datetime.date, which did work fine until I realised that later in the code there is some forced coercion of data types and I am unable to get around that. Example:

import pandas as pd
import datetime
data = pd.DataFrame({"col1": ["2022-12-31", "9999-12-31"]})
data["col2"] = data["col1"].apply(lambda x: datetime.date.fromisoformat(x))

# .... company code ....

# forced coercion
data.astype(dtype={"col1": str, "col2": datetime.date})

Among others, I have tried to supply the following types for the coercion of col2, to no avail.

  • datetime.date -> dtype '<class 'datetime.date'>' not understood
  • np.datetime64 -> Out of bounds nanosecond timestamp: 9999-12-31 00:00:00

I have also tried to convert with data["col2"].dt.to_pydatetime(), which yields a "Can only use .dt accessor with datetimelike values" error.

I suppose what baffles me the most is that the astype() documentation says that one can supply any numpy or Python data type, and my understanding is that datetime.date is a Python data type. So why wouldn't astype() know it?

CodePudding user response:

Yeah, it's not exactly true that you can supply any valid Python type (and I don't think the docs claim this). I guess what is meant is that you can supply both numpy types as well as the extended dtypes added by Pandas (or other libraries), listed in this table.

In your case you might want to use a Period, which is one of the suggestions in the docs for this case. So it would be something like:

def convert_to_date(x):
    y, m, d = [int(n) for n in x.split('-')]
    return pd.Period(year=y, month=m, day=d, freq="D")

data = pd.DataFrame({"col1": ["2022-12-31", "9999-12-31"]})
data["col2"] = data.col1.apply(convert_to_date)

data.astype(dtype={"col1": str, "col2": 'period[D]'})

This way you can use vectorised operations (ie it's not an object dtype like you had with datetime.date), but you can also store dates outside the bounds of pandas datetime64[ns] dtype.

  • Related