Home > Net >  Converting pandas column to date, with many type of dates
Converting pandas column to date, with many type of dates

Time:12-16

how would you convert the below date column into a single formatted date column?

df = pd.DataFrame(data={'datecol': ["-",
                                    "44198",
                                    "2021/01/01", 
                                    "14.04.20",
                                    "2021-13-03"]})

print(df.dropna()) should return the result below:

      datecol
0  2021-01-02
1  2021-01-01
2  2020-04-14
3  2021-03-13

Thanks!

CodePudding user response:

  1. Convert all valid datetime formats using pd.to_datetime, specifying formats for unrecognised formats
  2. Convert all integer (Excel) dates.
  3. Combine both with fillna
parsed = pd.to_datetime(df["datecol"], errors="coerce").fillna(pd.to_datetime(df["datecol"],format="%Y-%d-%m",errors="coerce"))
ordinal = pd.to_numeric(df["datecol"], errors="coerce").apply(lambda x: pd.Timestamp("1899-12-30") pd.Timedelta(x, unit="D"))

df["datecol"] = parsed.fillna(ordinal)

>>> df
     datecol
0        NaT
1 2021-01-02
2 2021-01-01
3 2020-04-14
4 2021-03-13

CodePudding user response:

If a column contains multiple formats, you're going to need to parse the column multiple times with the different formats and use combine_first to combine the resulting information. Because we specify errors='coerce' the date format should only match one of the formats.

The other small complication is that some of your formats require you to just specify the format argument, but others would require the origin and unit parameters. We can take care of this passing a dict of kwargs to the pd.to_datetime function.

Note any numeric values will work with origin and unit so you can't use this method if your date column had values that represented different units with different offsets in the same column. You would need to provide other logic to indicate which units and offsets are pertinent to which rows in that case.

import pandas as pd
from functools import reduce 

kwl = [{'format': '%Y/%m/%d'},
       {'format': '%d.%m.%y'},
       {'format': '%Y-%d-%m'},
       {'format': '%Y/%m/%d'},
       {'unit': 'd', 'origin': '1899-12-30'}]]

l = []
for kwargs in kwl:
    if 'unit' in kwargs.keys():
        s = pd.to_numeric(df['datecol'], errors='coerce')
    else:
        s = df['datecol']
        
    l.append(pd.to_datetime(s, errors='coerce', **kwargs))

result = reduce(lambda l,r: l.combine_first(r), l)

print(result)
#0          NaT
#1   2021-01-02
#2   2021-01-01
#3   2020-04-14
#4   2021-03-13
Name: datecol, dtype: datetime64[ns]
  • Related