I've the following text:
text_main = "The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d. The other leagues aren't that important."
I need to extract all the sport names (which are coming after sport:
) and style (which are coming after style:
) and create new columns as sports
and style
. I'm trying the following code to extract the main sentence (sometimes text are huge):
m = re.split(r'(?<=\.)\s (?=[A-Z]\w )', text_main)
text = list(filter(lambda x: re.search(r'leagues were identified', x, flags=re.IGNORECASE), m))[0]
print(text)
The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d.
Then I'm extracting the sport and style names and putting them into a dataframe:
if 'sport:' in text:
sport_list = re.findall(r'sport:\W*(\w )', text)
df = pd.DataFrame({'sports': sport_list})
print(df)
sports
0 basketball
1 soccer
2 football
However, I'm having troubles to extract the styles, as all the styles have period .
after the 1st letter (c
) and few has sign >
. Also, not all the sports have style info.
Desired output:
sports style
0 basketball c.123>d
1 soccer NA
2 football c.124>d
What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!
CodePudding user response:
You can use
\bsport:\s*(\w )(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S ))?
See the regex demo. Details:
\b
- a word boundarysport:
- a fixed string\s*
- zero or more whitespaces(\w )
- Group 1: one or more word chars(?:
- start of an optional non-capturing group:(?:(?!\bsport:).)*?
- any char other than line break chars, zero or more occurrences but as few as possible, that does not start a whole wordsport:
char sequence\bstyle:
- a whole wordstyle
and then:
\s*
- zero or more whitespaces(\S )
- Group 1: one or more non-whitespace chars
)?
- end of the optional non-capturing group.
See the Python demo:
import pandas as pd
text_main = "The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d. The other leagues aren't that important."
matches = re.findall(r'\bsport:\s*(\w )(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S ))?', text_main)
df = pd.DataFrame(matches, columns=['sports', 'style'])
Output:
>>> df
sports style
0 basketball c.123>d
1 soccer
2 football c.124>d.