Clean inconsistent date formatting in pandas dataframe column-CodePudding

I've read about coercing errors and eyeballing stuff like this, but I was wondering if there was a more optimal way to automate this issue:

I have a regular dataset being outputted by a system, then manually modified by some folks. Unfortunately, the modifications tend to be inconsistent with the original system outputs, so I end up having to manually standardize the formatting before loading it into a pandas dataframe. Is there a smarter way to do this -- i.e., something I'm not aware of where pandas or some other function would be able to clean this for me?

Sample dataframe column with its messing data types:

START_DTTIME
-----
2021-11-01 17:10:00
2021-11-01 17:12:00
2021-11-01 17:15:00
11/3/2021
11/4/21
UNKNOWN
UNK
2021-10-04 14:01:20
10-5-21
10-30-2021
???
2021-10-05 14:03:21

The dataset usually is pretty manageable in size (no more than 100 records daily), so I was thinking if absolutely necessary I could just make a function to loop through each record checking for all the different variations that are commonplace (there are only so many different ways one could type in a date, right?)... but that's a last resort as I wanted to check if there's a "smarter" way to do this first before I do something wildly inefficient. :-)

If it helps, I only care about the DATE; the time is actually extraneous info produced by the system, but as you can observe in the non-standardized formatting, the manual inputs only consist of the date.

Thank you!

CodePudding user response：

Dataframe

   df=pd.DataFrame({'START_DTTIME':['2021-11-01 17:10:00','11/3/2021','11/4/21','UNKNOWN','10-30-2021','???']})

convert the column to datetime, coerce errors to create NaN and then select those that are not NaN

df[pd.to_datetime(df['START_DTTIME'], errors='coerce').notna()]



     START_DTTIME
0  2021-11-01 17:10:00
1            11/3/2021
2              11/4/21
4           10-30-2021