Home > Back-end >  Use python regex to replace dataframe column values with decimal part of string
Use python regex to replace dataframe column values with decimal part of string

Time:10-23

I have a dataframe with a column df['gravidityAndParity'] that contains string values like so:

0      g4p3
1      g2p0
2      g7p2
3      g2p0
4      g7p6

The number after 'g' is the gravidity and the number after 'p' is the parity. I am trying to split this column into two columns: df['gravidity'] and df['parity]

So the output I am after is:

print(df['gravidity']) 
0      4
1      2
2      7
3      2
4      7
print(df['parity])
0      3
1      0
2      2
3      0
4      6

I defined a function using regex to do this, but the function is not working correctly.

Here is my code so far:


import regex as re 
  
# Function to clean the names
def Split_gravidity_parity(gravidityAndParity):
    match_gravidity = re.search('g(\d )', gravidityAndParity)
    if match_gravidity:
        df['gravidity']= match_gravidity.group(1)
        
    match_parity = re.search('p(\d )', gravidityAndParity)
    if match_parity:
        df['parity']= match_parity.group(1) 

Applying the function to the column:


df['gravidityAndParity'].apply(Split_gravidity_parity)

print(df['gravidity'])
0      4
1      4
2      4
3      4
4      4
print(df['parity'])
0      3
1      3
2      3
3      3
4      3

The function seems to be partially working, as it only seems to be applied to the first value in the column 'g4p3'.

Any help with how I can implement this regex function correctly to all values in the column and output the results in two new columns 'gravidity' and 'parity'?

CodePudding user response:

You can use the built-in re with Series.str.extract:

import pandas as pd
df=pd.DataFrame({'gravidityAndParity':['g4p3','g2p0','g7p2','g2p0','g7p6']})
df[['gravity','parity']] = df['gravidityAndParity'].str.extract(r'g(\d )p(\d )')
# => >>> df
#       gravidityAndParity gravity parity
#     0               g4p3       4      3
#     1               g2p0       2      0
#     2               g7p2       7      2
#     3               g2p0       2      0
#     4               g7p6       7      6

The g(\d )p(\d ) pattern captures one or more digits after g into Group 1 (the "gravity" column) and the matches p and then captures one or more digits into Group 2 (the "parity" column).

  • Related