My column "released" contains strings such as "Nov 21, 2000", "Nov, 2000", "2000". I want create three columns as "year", "month" and "day". When there are no month or no day, return none.
I cannot use pd.to_datetime as by doing so my string "Nov, 2000" will turn to "Nov, 01, 2000" and my string "2000" will turn to "Jan, 01, 2000". Therefore, currently I use this code:
df["year"] = df.released.str[-4:]
df["month_date"] = df.released.str[:-6]
df["month"] = df.month_date.str[0:3]
df["day"] = df.month_date.apply(lambda row: row[-2:] if len(row) > 5 else None)
del df["month_date"]
But I want to know whether there are cleverer method to achieve the goal.
Here is some data I have for the "released".
0 Nov 21, 2000
1 Sep 08, 2000
2 Nov, 1999
3 Nov 26, 1998
4 Apr, 2002
...
2699 2001
2700 Mar 26, 1998
2701 Oct 07, 1999
2702 Nov 20, 1995
2703 Mar 15, 1996
CodePudding user response:
You can extract Day, Month, year from the dataframe using regex
df['year'] = df['date'].str.extract(r'(\d{4})')
df['day'] = df['date'].str.extract(r'\s(\d{2}),')
df['month'] = df['date'].str.extract(r'([aA-zZ] )')
or you can do the same in one line using pandas extractall
p = r'(\d{4})|\s(\d{2}),|([a-zA-Z] )'
df[['year', 'day', 'month']] = df['date'].str.extractall(p).groupby(level=0).first()
This gives us
date year day month
0 Nov 21, 2000 2000 21 Nov
1 Sep 08, 2000 2000 08 Sep
2 Nov, 1999 1999 NaN Nov
3 Nov 26, 1998 1998 26 Nov
4 Apr, 2002 2002 NaN Apr
5 2001 2001 NaN NaN
6 Mar 26, 1998 1998 26 Mar
7 Oct 07, 1999 1999 07 Oct
8 Nov 20, 1995 1995 20 Nov
9 Mar 15, 1996 1996 15 Mar