I'm trying to split a UFC record column into multiple columns and am having trouble. The data looks like this
record
1 22–8–1
2 18–7–1
3 12–4
4 8–2 (1 NC)
5 23–9–1
6 23–12
7 19–4–1
8 18–5–1 (1 NC)
The first number is wins, the second losses. If there is a third it is the draws, and if there is a parenthesis and a number it is the "no contests". I want to split it up and have it look like this.
wins loses draws no_contests
1 22 8 1 NaN
2 18 7 1 NaN
3 12 4 NaN NaN
4 8 2 NaN 1
5 23 9 1 NaN
6 23 12 NaN NaN
7 19 4 1 NaN
8 18 5 1 1
I tried using .str.split("-")
which just made things more complicated for me. Then I tried making a for
loop with a bunch of if
statements to try and filter out some of the ore complicated records but failed miserably at that. Does anyone have any ideas as to what I could do? Thanks so much!
# So you can copy and paste the data in
import pandas as pd
data = {'record': ['22–8–1', '18–7–1', '12–4', '8–2 (1 NC)', '23–9–1', '23–12', '19–4–1', '18–5–1 (1 NC)']}
df = pd.DataFrame(data)
CodePudding user response:
This is a job for pandas.Series.str.extract()
:
# Fix em-dashes
df['record'] = df['record'].str.replace('–', '-')
new_df = df['record'].str.extract(r'^(?P<wins>\d )-(?P<loses>\d )(?:-(?P<draws>\d ))?\s*(?:\((?P<no_contests>\d ) NC\))?')
Output:
>>> new_df
wins loses draws no_contests
0 22 8 1 NaN
1 18 7 1 NaN
2 12 4 NaN NaN
3 8 2 NaN 1
4 23 9 1 NaN
5 23 12 NaN NaN
6 19 4 1 NaN
7 18 5 1 1