Use python regex to replace dataframe column values with decimal part of string-CodePudding

I have a dataframe with a column df['gravidityAndParity'] that contains string values like so:

0      g4p3
1      g2p0
2      g7p2
3      g2p0
4      g7p6

The number after 'g' is the gravidity and the number after 'p' is the parity. I am trying to split this column into two columns: df['gravidity'] and df['parity]

So the output I am after is:

print(df['gravidity'])

print(df['parity])

I defined a function using regex to do this, but the function is not working correctly.

Here is my code so far:


import regex as re 
  
# Function to clean the names
def Split_gravidity_parity(gravidityAndParity):
    match_gravidity = re.search('g(\d )', gravidityAndParity)
    if match_gravidity:
        df['gravidity']= match_gravidity.group(1)
        
    match_parity = re.search('p(\d )', gravidityAndParity)
    if match_parity:
        df['parity']= match_parity.group(1)

Applying the function to the column:


df['gravidityAndParity'].apply(Split_gravidity_parity)

print(df['gravidity'])

print(df['parity'])

The function seems to be partially working, as it only seems to be applied to the first value in the column 'g4p3'.

Any help with how I can implement this regex function correctly to all values in the column and output the results in two new columns 'gravidity' and 'parity'?

CodePudding user response：

You can use the built-in re with Series.str.extract:

import pandas as pd
df=pd.DataFrame({'gravidityAndParity':['g4p3','g2p0','g7p2','g2p0','g7p6']})
df[['gravity','parity']] = df['gravidityAndParity'].str.extract(r'g(\d )p(\d )')
# => >>> df
#       gravidityAndParity gravity parity
#     0               g4p3       4      3
#     1               g2p0       2      0
#     2               g7p2       7      2
#     3               g2p0       2      0
#     4               g7p6       7      6

The g(\d )p(\d ) pattern captures one or more digits after g into Group 1 (the "gravity" column) and the matches p and then captures one or more digits into Group 2 (the "parity" column).