Home > Back-end >  How to extract substring between parentheses that start with a digit, and has multiple sets of paren
How to extract substring between parentheses that start with a digit, and has multiple sets of paren

Time:06-16

My goal is to extract the substring between a set of parentheses, but only if it starts with a digit. Several of the strings will have multiple sets of parentheses but only one will contain a string that starts with a digit.

Currently, it is extracting everything between the first parenth and the last one, rather than it seeing 2 seprate sets of them.

As far as only using the parentheses with a substring that starts with a digit, I am lost as to how to even approach this.

Any help is appreciated.

import pandas as pd

cols = ['a', 'b']
data = [
    ['xyz - (4 inch), (four inch)', 'abc'],
    ['def', 'ghi'],
    ['xyz - ( 5.5 inch), (five inch)', 'abc'],
]
df = pd.DataFrame(data=data, columns=cols)
df['c'] = df['a'].str.extract("\((.*)\)") 

Desired output:

                                a    b       c
0     xyz - (4 inch), (four inch)  abc  4 inch
1                             def  ghi     NaN
2  xyz - ( 5.5 inch), (five inch)  abc     NaN

current output:

                                a    b                       c
0     xyz - (4 inch), (four inch)  abc     4 inch), (four inch
1                             def  ghi                     NaN
2  xyz - ( 5.5 inch), (five inch)  abc   5.5 inch), (five inch

CodePudding user response:

The following pattern should do the job: \((\d[^.)] )\)

What it does is

  • Matches the character '('
  • Start capturing numbers and everything that doesn't contain ')' or '.'.
  • End capturing.
  • Matches the character ')'

You can see a detailed explanation on regex101

Final code:

import pandas as pd

cols = ['a', 'b']
data = [
    ['xyz - (4 inch), (four inch)', 'abc'],
    ['def', 'ghi'],
    ['xyz - ( 5.5 inch), (five inch)', 'abc'],
]
df = pd.DataFrame(data=data, columns=cols)
df['c'] = df['a'].str.extract("\((\d[^.)] )\)") 

print(df)

Output generated:

a    b       c
0     xyz - (4 inch), (four inch)  abc  4 inch
1                             def  ghi     NaN
2  xyz - ( 5.5 inch), (five inch)  abc     NaN
  • Related