I am extracting a substring from an Excel cell and the entire string says this:
The bolts are 5" long each and 3" apart
I want to extract the length of the bolt which is 5"
. And I use the following code to get that
df['Bolt_Length'] = df['Description'].str.extract(r'(\s[0-9]")',expand=False)
But if the string says the following:
The bolts are 10" long each and 3" apart
and I try to use to the following code:
df['Bolt_Length'] = df['Description'].str.extract(r'(\s(\d{1,2})")',expand=False)
I get the following error message:
ValueError: Columns must be same length as key
I think Python doesn't know which number to acquire. The 10"
or 3"
How can I fix this? How do I tell Python to only go for the first "
?
On another note what if I want to get both the bolt length and distance from another bolt? How do I extract the two at the same time?
CodePudding user response:
Your problem is that you have two capture groups in your second regular expression (\s(\d{1,2})")
, not one. So basically, you're telling Python to get the number with the "
, and the same number without the "
:
>>> df['Description'].str.extract(r'(\s(\d{1,2})")', expand=False)
0 1
0 5" 5
1 10" 10
You can add ?:
right after the opening parenthesis of a group to make it so that it does not capture anything, though it still functions as a group. The following makes it so that the inner group, which excludes the "
, does not capture:
# notice vv
>>> df['Description'].str.extract(r'(\s(?:\d{1,2})")', expand=False)
0 5"
1 10"
Name: Description, dtype: object
CodePudding user response:
The error occurs because your regex contains two capturing groups, that extract two column values, BUT you assign those to a single column, df['Bolt_Length']
.
You need to use as many capturing groups in the regex pattern as there are columns you assign the values to:
df['Bolt_Length'] = df['Description'].str.extract(r'\s(\d{1,2})"',expand=False)
The \s(\d{1,2})"
regex only contains one pair of unescaped parentheses that form a capturing group, so this works fine since this single value is assigned to a single Bolt_Length
column.