Simplified huge df with date column of inconsistent string formatting containing errors:
df_length = 10000
df = pd.DataFrame({
"to_ignore": np.random.randint(1, 500, df_length),
"date": np.random.choice(["11 Nov 2018", "Feb 2019", "2021-11-02", "asdf"], df_length),
})
We need to convert date
col to datetime but can't find a solution that doesn't drop data or processes within a usable time. Tried formatting successively with errors='ignore'
:
df['date'] = pd.to_datetime(df['date'], format='%b %Y', errors='ignore')
df['date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='ignore')
But with erroneous strings ("asdf") the col seems unaffected. Trying formats successively with errors='coerce'
obviously drops data.
We tried dateparser, df['date'] = df['date'].apply(lambda x: dateparser.parse(x))
, which kinda worked except it sometimes got date of month wrong (2019-02-02 should be 2019-02-01):
to_ignore date
0 115 2019-02-02
1 285 NaT
...
This is also prohibitively slow (play with df_length
).
What's a good way to do this?
CodePudding user response:
Figured it out. df['date'] = pd.to_datetime(df['date'], errors='coerce')
is performant and captures common formats. My question assumed this wasn't the case due to a formatting mistake I've corrected to help others avoid confusion.
If you need to capture dates in complex strings you can create a function to use dateparser.parse()
as needed when matching regex expressions:
def date_process(x):
if bool(re.search("^\D\D\D \d\d\d\d$", x)):
return dt.datetime.strptime(x, "%b %Y")
elif bool(re.search("^\d\d \D\D\D \d\d\d\d$", x)):
return dt.datetime.strptime(x, "%d %b %Y")
elif bool(re.search("^\d\d\d\d-\d\d-\d\d$", x)):
return dt.datetime.strptime(x, "%Y-%m-%d")
else:
return dateparser.parse(x)
df['date'] = df['date'].apply(date_process)