Home > Enterprise >  How to use pandas to extract incomplete dates
How to use pandas to extract incomplete dates

Time:08-30

I have a file with a "Collection Date" column, in this column most of the dates are written YYYY-MM-DD, for example "2020-03-23", but some entries only have the year (e.g. 2020).

e.g.

0             2020
1       2020-03-23
2       2020-12-11
3       2020-04-10
4       2020-04-03

I want to find the entries that only have the year and convert them to NaT, so that I can then pull those entries out to a separate file.

I thought I would be able to do this with pandas pd.to_datetime - i've tried the following:

df["Collection date"]=pd.to_datetime(df["Collection date"], format='%Y-%m-%d', errors='coerce', exact=True) 

However this converts the "2020" entry to 2020-01-01, rather than NaT. I thought this would work as I've specified the Y-m-d format, and that it must be an exact match, but I'm obviously missing something here.

Can anyone suggest how I can get the "2020" entry replaced with NaT, rather than converted to a date? 2020-01-01 is not the correct date!

CodePudding user response:

The easiest solution would be to check the length of each value in the date column. If it is equal or less than 4 (len(2020) == 4) then it should be replaced with NaT else the entry is okay.

CodePudding user response:

You can convert to pd.NaT by applying a lambda:

df["Collection date"] = df["Collection date"].apply(
        lambda x: pd.to_datetime(x if "-" in x else pd.NaT, errors="coerce")
)
  • Related