I have a file with a "Collection Date" column, in this column most of the dates are written YYYY-MM-DD, for example "2020-03-23", but some entries only have the year (e.g. 2020).
e.g.
0 2020
1 2020-03-23
2 2020-12-11
3 2020-04-10
4 2020-04-03
I want to find the entries that only have the year and convert them to NaT, so that I can then pull those entries out to a separate file.
I thought I would be able to do this with pandas pd.to_datetime - i've tried the following:
df["Collection date"]=pd.to_datetime(df["Collection date"], format='%Y-%m-%d', errors='coerce', exact=True)
However this converts the "2020" entry to 2020-01-01, rather than NaT. I thought this would work as I've specified the Y-m-d format, and that it must be an exact match, but I'm obviously missing something here.
Can anyone suggest how I can get the "2020" entry replaced with NaT, rather than converted to a date? 2020-01-01 is not the correct date!
CodePudding user response:
The easiest solution would be to check the length of each value in the date column. If it is equal or less than 4 (len(2020) == 4) then it should be replaced with NaT else the entry is okay.
CodePudding user response:
You can convert to pd.NaT by applying a lambda:
df["Collection date"] = df["Collection date"].apply(
lambda x: pd.to_datetime(x if "-" in x else pd.NaT, errors="coerce")
)