I have a dataframe with a column df['gravidityAndParity'] that contains string values like so:
0 g4p3
1 g2p0
2 g7p2
3 g2p0
4 g7p6
The number after 'g' is the gravidity and the number after 'p' is the parity. I am trying to split this column into two columns: df['gravidity'] and df['parity]
So the output I am after is:
print(df['gravidity'])
0 4
1 2
2 7
3 2
4 7
print(df['parity])
0 3
1 0
2 2
3 0
4 6
I defined a function using regex to do this, but the function is not working correctly.
Here is my code so far:
import regex as re
# Function to clean the names
def Split_gravidity_parity(gravidityAndParity):
match_gravidity = re.search('g(\d )', gravidityAndParity)
if match_gravidity:
df['gravidity']= match_gravidity.group(1)
match_parity = re.search('p(\d )', gravidityAndParity)
if match_parity:
df['parity']= match_parity.group(1)
Applying the function to the column:
df['gravidityAndParity'].apply(Split_gravidity_parity)
print(df['gravidity'])
0 4
1 4
2 4
3 4
4 4
print(df['parity'])
0 3
1 3
2 3
3 3
4 3
The function seems to be partially working, as it only seems to be applied to the first value in the column 'g4p3'.
Any help with how I can implement this regex function correctly to all values in the column and output the results in two new columns 'gravidity' and 'parity'?
CodePudding user response:
You can use the built-in re
with Series.str.extract
:
import pandas as pd
df=pd.DataFrame({'gravidityAndParity':['g4p3','g2p0','g7p2','g2p0','g7p6']})
df[['gravity','parity']] = df['gravidityAndParity'].str.extract(r'g(\d )p(\d )')
# => >>> df
# gravidityAndParity gravity parity
# 0 g4p3 4 3
# 1 g2p0 2 0
# 2 g7p2 7 2
# 3 g2p0 2 0
# 4 g7p6 7 6
The g(\d )p(\d )
pattern captures one or more digits after g
into Group 1 (the "gravity" column) and the matches p
and then captures one or more digits into Group 2 (the "parity" column).