Extract Numeric info from Pandas column using regex-CodePudding

I am trying to extract the highlighted "numeric information" from a Pandas DataFrame column:

Text
Dimensions: `23"/60`
Dimensions: `23" / 60`
Dimensions: `48"`
Dimensions: `22.5X8.25`
Dimensions: `80IN`
Dimensions: `567 S`
Dimensions: `22.5X8.25`
Dimensions: `26INNP`
Dimensions: `24" x 55"` with pipe `16 x 7`

I am using regex and is as follows: regex = r"(\d([^\s]*)\s.\s\d*[^\s])|(\d([^\s])*)"

I am using to the below script to create a new column with all the numeric information extracted from each sentence and combined with '_'

df2['Numeric_Info'] = df2['Text'].apply(lambda x: '_'.join([i[0] for i in re.findall(regex, str(x))]))

Can someone tell me what I am doing wrong?

Edit:

Specifying rules to the pattern to solidify the regex:

Sentences containing words with numbers should be extracted (for sentence short A20-6014 complete, A20-6014 should be extracted)
Sentences beginning with words with numbers should be extracted (for sentence A20-6014 short complete, A20-6014 should be extracted)
Sentences ending with words with numbers should be extracted (for sentence short complete A20-6014, A20-6014 should be extracted)
Two letter words in between words containing numbers should be considered (ie. for sentence ELBOW- 2.5 IN 90 SIL, 2.5 IN 90 should be extracted)
Greater than 2 letter words in between words containing numbers should extracted separately and joined with a _. This also applies for alphanumeric words occurring at two different places in a sentence (ie. for sentence ELBOW- 2.5 IND 90 SIL, 2.5_90 should be extracted. )
Another Test Case: for a sentence 90 CONN 3/4 ST-#8 FL, 90_3/4_ST-#8 should be extracted

CodePudding user response：

You can use

regex = r'\d (?:\.\d )?"?(?:\s*[Xx/.]\s*\d (?:\.\d )?"?)?(?:\s*[A-Z] \b)?'
df2['Numeric_Info'] = df2['Text'].str.findall(regex).str.join('_')

See the regex demo.

Details:

\d (?:\.\d )? - an int or float value
"? - an optional " char
(?:\s*[Xx/.]\s*\d (?:\.\d )?"?)? - an optional sequence of
- \s*[Xx/.]\s* - X, x, / or . enclosed with zero or more whitespaces
- \d (?:\.\d )? - an int or float value
- "? - an optional " char
(?:\s*[A-Z] \b)? - an optional sequence of
- \s* - zero or more whitespaces
- [A-Z] - one or more ASCII uppercase letters
- \b - a word boundary.

Here is a Pandas test:

import pandas as pd
df = pd.DataFrame({'Text':['No data here','Dimensions: 23"/60', 'Dimensions: 23" / 60', 'Dimensions: 48"', 'Dimensions: 22.5X8.25', 'Dimensions: 80IN', 'Dimensions: 567 S','Dimensions: 22.5X8.25', 'Dimensions: 26INNP','Dimensions: 24" x 55" with pipe 16 x 7']})
regex = r'\d (?:\.\d )?"?(?:\s*[Xx/.]\s*\d (?:\.\d )?"?)?(?:\s*[A-Z] \b)?'
df['Numeric_Info'] = df['Text'].str.findall(regex).str.join('_')

Output:

>>> df
                                     Text      Numeric_Info
0                            No data here                  
1                      Dimensions: 23"/60            23"/60
2                    Dimensions: 23" / 60          23" / 60
3                         Dimensions: 48"               48"
4                   Dimensions: 22.5X8.25         22.5X8.25
5                        Dimensions: 80IN              80IN
6                       Dimensions: 567 S             567 S
7                   Dimensions: 22.5X8.25         22.5X8.25
8                      Dimensions: 26INNP            26INNP
9  Dimensions: 24" x 55" with pipe 16 x 7  24" x 55"_16 x 7