I've got this DataFrame, which is a description of sports and their leagues:
df = pd.DataFrame({
'text': ['The following leagues were identified: sport: basketball league: NBA sport: soccer league:EPL sport: football league: NFL']
})
df.text.iloc[0]
===============
The following leagues were identified: sport: basketball league: NBA sport: soccer league: EPL sport: football league: NFL
I need to extract all the sport names (which are coming after sport:
) and put them as a list in a new column sports
. I'm trying the following code:
pat = 'sport:\W (?P<sports>(?:\w ))'
new = df.text.str.extract(pat, expand=True)
df.assign(**new)
===============
text sports
0 The following leagues were identified: sport: ... basketball
However, it's returning only the 1st occurrence of the sport as a standalone string, whereas I need all the sports as a list.
Desired output:
text sports
0 The following leagues were ... [basketball, soccer, football]
What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!
CodePudding user response:
You can use Series.str.findall
:
df['sports'] = df['text'].str.findall(r'sport:\W*(\w )')
Pandas test:
import pandas as pd
df = pd.DataFrame({
'text': ['The following leagues were identified: sport: basketball league: NBA sport: soccer league:EPL sport: football league: NFL']
})
Output:
>>> df['text'].str.findall('sport:\W*(\w )')
0 [basketball, soccer, football]
Name: text, dtype: object