Is there a more elegant way to catch multiple date parsing formats in python-CodePudding

I'm working on learning to clean datasets and am encountering multiple date formats. There appears to be only two formats datetimes were stored as. The below code works fine, but feels clunky. Is there a better (or faster) way to handle multiple formats in a DataFrame? If it were only one format, it would be doable in just one line and I like avoiding try/except statements if I can.

For reference, the data goes from 1963 to 2010, so declaring the date_parsed column as 2023 seemed safe to me. Thanks!

data['date_parsed'] = pd.to_datetime('01/01/2023', format='%m/%d/%Y')

for i in range(len(data.Date)):

try:
    data['date_parsed'][i] = pd.to_datetime(data.Date[i], format='%m/%d/%Y')

except:
    data['date_parsed'][i] = pd.to_datetime(data.Date[i], format='%Y-%m-%dT%H:%M:%S.%fZ')

CodePudding user response：

Use Series.fillna for replace missing values from non matched values in to_datetime by parameter errors='coerce':

d1 = pd.to_datetime(data.Date, format='%m/%d/%Y', errors='coerce')
d2 = pd.to_datetime(data.Date, format='%Y-%m-%dT%H:%M:%S.%fZ', errors='coerce')

data['date_parsed'] = d1.fillna(d2)

CodePudding user response：

You can use functools.reduce with a list comprehension of pandas.to_datetime and your various formats, to handle any number of formats:

from functools import reduce

formats = ['%m/%d/%Y', '%Y-%m-%dT%H:%M:%S.%fZ']

data['date_parsed'] = reduce(lambda a,b: a.fillna(b),
                             [pd.to_datetime(data['Date'],
                                             format=fmt, errors='coerce',
                                             utc=fmt.endswith('Z'))
                              for fmt in formats])

Note that if you have a Z in the format, you need utc=True.

Example:

                          Date                       date_parsed
0                   01/01/2023               2023-01-01 00:00:00
1  2023-01-01T12:00:00.123456Z  2023-01-01 12:00:00.123456 00:00

CodePudding user response：

Basically, you can rely on infer_datetime_format parameter:

infer_datetime_format: bool, default False

If True and no format is given, attempt to infer the format of the datetime strings based on the first non-NaN element, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.

df = pd.DataFrame({'date_parsed': ['01/01/2023', '2023-01-01T14:53:30.200Z']})
pd.to_datetime(df['date_parsed'], infer_datetime_format=True)

0                 2023-01-01 00:00:00
1    2023-01-01 14:53:30.200000 00:00