Home > Enterprise >  Pandas mixed date column with Excel dates, floats, int, string dates - convert to datetime
Pandas mixed date column with Excel dates, floats, int, string dates - convert to datetime

Time:02-23

I am trying to convert a column with a real mix of date formats. I have tried a few things on SO but still not got a working solution. I have tried changing column to 'string', also tried converting the floats in int.

data

    date
1   43076.0
2   43077
3   07 Dec 2017
4   2021-12-22 00:00:00

code to try and fix the Excel dates and '07 Dec 2017' style

d = ['43076.0', '43077', '07 Dec 2017', '2021-12-22 00:00:00']
df = pd.DataFrame(d, columns=['date'])

date1 = pd.to_datetime(df['date'], errors='coerce', format='%d %a %Y')
date2 = pd.to_datetime(df['date'], errors='coerce', unit='D', origin='1899-12-30')
frame_clean[col] = date2.fillna(date1)

error

Name: StartDate, Length: 16189, dtype: object' is not compatible with origin='1899-12-30'; it must be numeric with a unit specified

I like this solution rather than using apply as to slow. But I am struggling to get it working.

Edit

Breaking down @FObersteiner solution for better understanding.

convert the simple dates

df['datetime'] = pd.to_datetime(df['date'], errors='coerce')

0          NaT
1          NaT
2   2018-12-07
3   2021-12-22

isolate the numeric rows

m = pd.to_numeric(df['date'], errors='coerce').notna()
m

0     True
1     True
2    False
3    False

convert numeric rows to floats

df['date'][m].astype(float)

0    43080.0
1    43077.0

convert numeric rows to floats and then dt objects

pd.to_datetime(df['date'][m].astype(float), errors='coerce', unit='D', origin='1899-12-30')

0   2017-12-11
1   2017-12-08

pull it alltogether and bring back the simple date rows

df.loc[m, 'datetime'] = pd.to_datetime(df['date'][m].astype(float), errors='coerce', unit='D', origin='1899-12-30')
print(df)

CodePudding user response:

For given example, use a mask to convert numeric and non-numeric data separately:

import pandas as pd

df = pd.DataFrame({'date':['43076.0', '43077', '07 Dec 2017', '2021-12-22 00:00:00']})

df['datetime'] = pd.to_datetime(df['date'], errors='coerce')

m = pd.to_numeric(df['date'], errors='coerce').notna()
df.loc[m, 'datetime'] = pd.to_datetime(df['date'][m].astype(float), errors='coerce', unit='D', origin='1899-12-30')

print(df)
                  date   datetime
0              43076.0 2017-12-07
1                43077 2017-12-08
2          07 Dec 2017 2017-12-07
3  2021-12-22 00:00:00 2021-12-22
  • Related