My dataframe looks like:
School | Term | Students |
---|---|---|
A | summer 2020 | 324 |
B | spring 21 | 101 |
A | summer/spring | 201 |
F | wintersem | 44 |
C | fall trimester | 98 |
E | 23 |
I need to add a new column Termcode that assumes any of the 6 values: summer, spring, fall, winter, multiple, none based on corresponding value in the Term Column, viz:
School | Term | Students | Termcode |
---|---|---|---|
A | summer 2020 | 324 | summer |
B | spring 21 | 101 | spring |
A | summer/spring | 201 | multiple |
F | wintersem | 44 | winter |
C | fall trimester | 98 | fall |
E | 23 | none |
CodePudding user response:
You can use a regex with str.extractall
and filling of the values depending on the number of matches:
terms = ['summer', 'spring', 'fall', 'winter']
regex = r'(' '|'.join(terms) r')'
# '(summer|spring|fall|winter)'
# extract values and set up grouper for next step
g = df['Term'].str.extractall(regex)[0].groupby(level=0)
# get the first match, replace with "multiple" if more than one
df['Termcode'] = g.first().mask(g.nunique().gt(1), 'multiple')
# fill the missing data (i.e. no match) with "none"
df['Termcode'] = df['Termcode'].fillna('none')
output:
School Term Students Termcode
0 A summer 2020 324 summer
1 B spring 21 101 spring
2 A summer/spring 201 multiple
3 F wintersem 44 winter
4 C fall trimester 98 fall
5 E NaN 23 none
CodePudding user response:
Series.findall
l = ['summer', 'spring', 'fall', 'winter']
s = df['Term'].str.findall(fr"{'|'.join(l)}")
df['Termcode'] = np.where(s.str.len() > 1, 'multiple', s.str[0])
School Term Students Termcode
0 A summer 2020 324 summer
1 B spring 21 101 spring
2 A summer/spring 201 multiple
3 F wintersem 44 winter
4 C fall trimester 98 fall
5 E NaN 23 NaN