I am trying to convert a column with a real mix of date formats. I have tried a few things on SO but still not got a working solution. I have tried changing column to 'string', also tried converting the floats in int.
data
date
1 43076.0
2 43077
3 07 Dec 2017
4 2021-12-22 00:00:00
code to try and fix the Excel dates and '07 Dec 2017' style
d = ['43076.0', '43077', '07 Dec 2017', '2021-12-22 00:00:00']
df = pd.DataFrame(d, columns=['date'])
date1 = pd.to_datetime(df['date'], errors='coerce', format='%d %a %Y')
date2 = pd.to_datetime(df['date'], errors='coerce', unit='D', origin='1899-12-30')
frame_clean[col] = date2.fillna(date1)
error
Name: StartDate, Length: 16189, dtype: object' is not compatible with origin='1899-12-30'; it must be numeric with a unit specified
I like this solution rather than using apply as to slow. But I am struggling to get it working.
Edit
Breaking down @FObersteiner solution for better understanding.
convert the simple dates
df['datetime'] = pd.to_datetime(df['date'], errors='coerce')
0 NaT
1 NaT
2 2018-12-07
3 2021-12-22
isolate the numeric rows
m = pd.to_numeric(df['date'], errors='coerce').notna()
m
0 True
1 True
2 False
3 False
convert numeric rows to floats
df['date'][m].astype(float)
0 43080.0
1 43077.0
convert numeric rows to floats and then dt objects
pd.to_datetime(df['date'][m].astype(float), errors='coerce', unit='D', origin='1899-12-30')
0 2017-12-11
1 2017-12-08
pull it alltogether and bring back the simple date rows
df.loc[m, 'datetime'] = pd.to_datetime(df['date'][m].astype(float), errors='coerce', unit='D', origin='1899-12-30')
print(df)
CodePudding user response:
For given example, use a mask to convert numeric and non-numeric data separately:
import pandas as pd
df = pd.DataFrame({'date':['43076.0', '43077', '07 Dec 2017', '2021-12-22 00:00:00']})
df['datetime'] = pd.to_datetime(df['date'], errors='coerce')
m = pd.to_numeric(df['date'], errors='coerce').notna()
df.loc[m, 'datetime'] = pd.to_datetime(df['date'][m].astype(float), errors='coerce', unit='D', origin='1899-12-30')
print(df)
date datetime
0 43076.0 2017-12-07
1 43077 2017-12-08
2 07 Dec 2017 2017-12-07
3 2021-12-22 00:00:00 2021-12-22