I've read about coercing errors and eyeballing stuff like this, but I was wondering if there was a more optimal way to automate this issue:
I have a regular dataset being outputted by a system, then manually modified by some folks. Unfortunately, the modifications tend to be inconsistent with the original system outputs, so I end up having to manually standardize the formatting before loading it into a pandas dataframe. Is there a smarter way to do this -- i.e., something I'm not aware of where pandas or some other function would be able to clean this for me?
Sample dataframe column with its messing data types:
START_DTTIME
-----
2021-11-01 17:10:00
2021-11-01 17:12:00
2021-11-01 17:15:00
11/3/2021
11/4/21
UNKNOWN
UNK
2021-10-04 14:01:20
10-5-21
10-30-2021
???
2021-10-05 14:03:21
The dataset usually is pretty manageable in size (no more than 100 records daily), so I was thinking if absolutely necessary I could just make a function to loop through each record checking for all the different variations that are commonplace (there are only so many different ways one could type in a date, right?)... but that's a last resort as I wanted to check if there's a "smarter" way to do this first before I do something wildly inefficient. :-)
If it helps, I only care about the DATE; the time is actually extraneous info produced by the system, but as you can observe in the non-standardized formatting, the manual inputs only consist of the date.
Thank you!
CodePudding user response:
Dataframe
df=pd.DataFrame({'START_DTTIME':['2021-11-01 17:10:00','11/3/2021','11/4/21','UNKNOWN','10-30-2021','???']})
convert the column to datetime, coerce errors to create NaN and then select those that are not NaN
df[pd.to_datetime(df['START_DTTIME'], errors='coerce').notna()]
START_DTTIME
0 2021-11-01 17:10:00
1 11/3/2021
2 11/4/21
4 10-30-2021