Home > other >  Extract seasons and years from a string column in pandas
Extract seasons and years from a string column in pandas

Time:03-12

I just wondering if there is any other way I can extract the year from a column and assign two new columns to it where one column is for season and one for year?

I tried this method and it seems to work, but only work for year and selected rows:

year = df['premiered'].str.findall('(\d{4})').str.get(0)
df1 = df.assign(year = year.values)

Output:

|premiered||year|
|----------||---|
|Spring 1998||1998|
|Spring 2001||2001|
|Fall 2016||NaN|
|Fall 2016||NaN|

CodePudding user response:

Use Series.str.split with the expand option:

expand: Expand the split strings into separate columns.

df[['season', 'year']] = df['premiered'].str.split(expand=True)

#      premiered  season  year
# 0  Spring 1998  Spring  1998
# 1  Spring 2001  Spring  2001
# 2    Fall 2016    Fall  2016
# 3    Fall 2016    Fall  2016

Or use Series.str.extract with a regex:

  • (\w ) -- capture 1 word characters
  • \s* -- 0 whitespaces
  • (\d ) -- capture 1 digits
df[['season', 'year']] = df['premiered'].str.extract('(\w )\s*(\d )')

#      premiered  season  year
# 0  Spring 1998  Spring  1998
# 1  Spring 2001  Spring  2001
# 2    Fall 2016    Fall  2016
# 3    Fall 2016    Fall  2016

Also it would be a good idea to convert the new year column to numeric:

df['year'] = df['year'].astype(int)

CodePudding user response:

You could use a split function

data = { 'premiered' : ['Spring 1998', 'Spring 2001', 'Fall 2016', 'Fall 2016']}
df = pd.DataFrame(data)
df['year'] = df['premiered'].apply(lambda x : x.split(' ')[1])
df
  • Related