Efficiently processing dates from strings of inconsistent date formatting in pandas dataframes?-CodePudding

Simplified huge df with date column of inconsistent string formatting containing errors:

df_length = 10000
df = pd.DataFrame({
        "to_ignore": np.random.randint(1, 500, df_length),
        "date": np.random.choice(["11 Nov 2018", "Feb 2019", "2021-11-02", "asdf"], df_length),
    })

We need to convert date col to datetime but can't find a solution that doesn't drop data or processes within a usable time. Tried formatting successively with errors='ignore':

df['date'] = pd.to_datetime(df['date'], format='%b %Y', errors='ignore')
df['date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='ignore')

But with erroneous strings ("asdf") the col seems unaffected. Trying formats successively with errors='coerce' obviously drops data.

We tried dateparser, df['date'] = df['date'].apply(lambda x: dateparser.parse(x)), which kinda worked except it sometimes got date of month wrong (2019-02-02 should be 2019-02-01):

     to_ignore       date
0          115 2019-02-02
1          285        NaT
...

This is also prohibitively slow (play with df_length).

What's a good way to do this?

CodePudding user response：

Figured it out. df['date'] = pd.to_datetime(df['date'], errors='coerce') is performant and captures common formats. My question assumed this wasn't the case due to a formatting mistake I've corrected to help others avoid confusion.

If you need to capture dates in complex strings you can create a function to use dateparser.parse() as needed when matching regex expressions:

def date_process(x):
    if bool(re.search("^\D\D\D \d\d\d\d$", x)):
        return dt.datetime.strptime(x, "%b %Y")
    elif bool(re.search("^\d\d \D\D\D \d\d\d\d$", x)):
        return dt.datetime.strptime(x, "%d %b %Y")
    elif bool(re.search("^\d\d\d\d-\d\d-\d\d$", x)):
        return dt.datetime.strptime(x, "%Y-%m-%d")
    else:
        return dateparser.parse(x)

df['date'] = df['date'].apply(date_process)