My goal is to extract the substring between a set of parentheses, but only if it starts with a digit. Several of the strings will have multiple sets of parentheses but only one will contain a string that starts with a digit.
Currently, it is extracting everything between the first parenth and the last one, rather than it seeing 2 seprate sets of them.
As far as only using the parentheses with a substring that starts with a digit, I am lost as to how to even approach this.
Any help is appreciated.
import pandas as pd
cols = ['a', 'b']
data = [
['xyz - (4 inch), (four inch)', 'abc'],
['def', 'ghi'],
['xyz - ( 5.5 inch), (five inch)', 'abc'],
]
df = pd.DataFrame(data=data, columns=cols)
df['c'] = df['a'].str.extract("\((.*)\)")
Desired output:
a b c
0 xyz - (4 inch), (four inch) abc 4 inch
1 def ghi NaN
2 xyz - ( 5.5 inch), (five inch) abc NaN
current output:
a b c
0 xyz - (4 inch), (four inch) abc 4 inch), (four inch
1 def ghi NaN
2 xyz - ( 5.5 inch), (five inch) abc 5.5 inch), (five inch
CodePudding user response:
The following pattern should do the job: \((\d[^.)] )\)
What it does is
- Matches the character '('
- Start capturing numbers and everything that doesn't contain ')' or '.'.
- End capturing.
- Matches the character ')'
You can see a detailed explanation on regex101
Final code:
import pandas as pd
cols = ['a', 'b']
data = [
['xyz - (4 inch), (four inch)', 'abc'],
['def', 'ghi'],
['xyz - ( 5.5 inch), (five inch)', 'abc'],
]
df = pd.DataFrame(data=data, columns=cols)
df['c'] = df['a'].str.extract("\((\d[^.)] )\)")
print(df)
Output generated:
a b c
0 xyz - (4 inch), (four inch) abc 4 inch
1 def ghi NaN
2 xyz - ( 5.5 inch), (five inch) abc NaN