Home > OS >  Pandas to_datetime inconsistent conversions
Pandas to_datetime inconsistent conversions

Time:10-04

Here's a MRE to demonstrate the issue:

a = pd.DataFrame({'date': ['2021-01-01 00:00:00 1:00', '2021-01-01 00:00:01 1:00']}) 
b = pd.DataFrame({'date': ['2021-01-01 00:00:00 1:00', '2021-01-01 00:00:01 2:00']}) 

a['date'] = pd.to_datetime(a.date)
b['date'] = pd.to_datetime(b.date)
a.date.iloc[-1]
# gives Timestamp('2021-01-01 00:00:01 0100', tz='pytz.FixedOffset(60)')

b.date.iloc[-1]
# datetime.datetime(2021, 1, 1, 0, 0, 1, tzinfo=tzoffset(None, 7200))

So a has datetime strings from the same timezone and is converted to a Timestamp, b has datetime strings from two different timezones and is converted to a datetime.datetime object.

This is a problem because I want to use the .dt accessor on a DataFrame like b to convert timezones, but this (apparent) bug is stopping me.

CodePudding user response:

The issue you are getting at is that a['date'] is likely to being stored as a proper array behind the scene. While b['date'] is of different types, and has to be stored as a list of different types.

One way to go about this is to use the utc=True argument:

pd.to_datetime(b.date, utc=True)

https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

Here is some test code:

a['date'] = pd.to_datetime(a.date, utc=True)
b['date'] = pd.to_datetime(b.date, utc=True)
aval = a.date.iloc[-1]
print(aval)
print(type(aval))
print(a['date'].dt)
print(a['date'])

bval = b.date.iloc[-1]
print(bval)
print(type(bval))
print(b['date'].dt)
print(b['date'])

Output:

2020-12-31 23:00:01 00:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<pandas.core.indexes.accessors.DatetimeProperties object at 0x7f4aa41f3c40>
0   2020-12-31 23:00:00 00:00
1   2020-12-31 23:00:01 00:00
Name: date, dtype: datetime64[ns, UTC]
2020-12-31 22:00:01 00:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<pandas.core.indexes.accessors.DatetimeProperties object at 0x7f4aa41f3b20>
0   2020-12-31 23:00:00 00:00
1   2020-12-31 22:00:01 00:00
Name: date, dtype: datetime64[ns, UTC]

But do make sure that the conversion to UTC is done properly, i.e., that the UTC timestamp is adjusted appropriately given the timezones. I didn't check.

  • Related