I'm working on learning to clean datasets and am encountering multiple date formats. There appears to be only two formats datetimes were stored as. The below code works fine, but feels clunky. Is there a better (or faster) way to handle multiple formats in a DataFrame? If it were only one format, it would be doable in just one line and I like avoiding try/except statements if I can.
For reference, the data goes from 1963 to 2010, so declaring the date_parsed column as 2023 seemed safe to me. Thanks!
data['date_parsed'] = pd.to_datetime('01/01/2023', format='%m/%d/%Y')
for i in range(len(data.Date)):
try:
data['date_parsed'][i] = pd.to_datetime(data.Date[i], format='%m/%d/%Y')
except:
data['date_parsed'][i] = pd.to_datetime(data.Date[i], format='%Y-%m-%dT%H:%M:%S.%fZ')
CodePudding user response:
Use Series.fillna
for replace missing values from non matched values in to_datetime
by parameter errors='coerce'
:
d1 = pd.to_datetime(data.Date, format='%m/%d/%Y', errors='coerce')
d2 = pd.to_datetime(data.Date, format='%Y-%m-%dT%H:%M:%S.%fZ', errors='coerce')
data['date_parsed'] = d1.fillna(d2)
CodePudding user response:
You can use functools.reduce
with a list comprehension of pandas.to_datetime
and your various formats, to handle any number of formats:
from functools import reduce
formats = ['%m/%d/%Y', '%Y-%m-%dT%H:%M:%S.%fZ']
data['date_parsed'] = reduce(lambda a,b: a.fillna(b),
[pd.to_datetime(data['Date'],
format=fmt, errors='coerce',
utc=fmt.endswith('Z'))
for fmt in formats])
Note that if you have a Z
in the format, you need utc=True
.
Example:
Date date_parsed
0 01/01/2023 2023-01-01 00:00:00
1 2023-01-01T12:00:00.123456Z 2023-01-01 12:00:00.123456 00:00
CodePudding user response:
Basically, you can rely on infer_datetime_format
parameter:
infer_datetime_format: bool, default False
If
True
and no format is given, attempt to infer the format of the datetime strings based on the first non-NaN element, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.
df = pd.DataFrame({'date_parsed': ['01/01/2023', '2023-01-01T14:53:30.200Z']})
pd.to_datetime(df['date_parsed'], infer_datetime_format=True)
0 2023-01-01 00:00:00
1 2023-01-01 14:53:30.200000 00:00