Home > Back-end >  Extract Numeric info from Pandas column using regex
Extract Numeric info from Pandas column using regex

Time:10-08

I am trying to extract the highlighted "numeric information" from a Pandas DataFrame column:

Text
Dimensions: 23"/60
Dimensions: 23" / 60
Dimensions: 48"
Dimensions: 22.5X8.25
Dimensions: 80IN
Dimensions: 567 S
Dimensions: 22.5X8.25
Dimensions: 26INNP
Dimensions: 24" x 55" with pipe 16 x 7

I am using regex and is as follows: regex = r"(\d([^\s]*)\s.\s\d*[^\s])|(\d([^\s])*)"

I am using to the below script to create a new column with all the numeric information extracted from each sentence and combined with '_'

df2['Numeric_Info'] = df2['Text'].apply(lambda x: '_'.join([i[0] for i in re.findall(regex, str(x))]))

Can someone tell me what I am doing wrong?

Edit:

Specifying rules to the pattern to solidify the regex:

  • Sentences containing words with numbers should be extracted (for sentence short A20-6014 complete, A20-6014 should be extracted)
  • Sentences beginning with words with numbers should be extracted (for sentence A20-6014 short complete, A20-6014 should be extracted)
  • Sentences ending with words with numbers should be extracted (for sentence short complete A20-6014, A20-6014 should be extracted)
  • Two letter words in between words containing numbers should be considered (ie. for sentence ELBOW- 2.5 IN 90 SIL, 2.5 IN 90 should be extracted)
  • Greater than 2 letter words in between words containing numbers should extracted separately and joined with a _. This also applies for alphanumeric words occurring at two different places in a sentence (ie. for sentence ELBOW- 2.5 IND 90 SIL, 2.5_90 should be extracted. )
  • Another Test Case: for a sentence 90 CONN 3/4 ST-#8 FL, 90_3/4_ST-#8 should be extracted

CodePudding user response:

You can use

regex = r'\d (?:\.\d )?"?(?:\s*[Xx/.]\s*\d (?:\.\d )?"?)?(?:\s*[A-Z] \b)?'
df2['Numeric_Info'] = df2['Text'].str.findall(regex).str.join('_')

See the regex demo.

Details:

  • \d (?:\.\d )? - an int or float value
  • "? - an optional " char
  • (?:\s*[Xx/.]\s*\d (?:\.\d )?"?)? - an optional sequence of
    • \s*[Xx/.]\s* - X, x, / or . enclosed with zero or more whitespaces
    • \d (?:\.\d )? - an int or float value
    • "? - an optional " char
  • (?:\s*[A-Z] \b)? - an optional sequence of
    • \s* - zero or more whitespaces
    • [A-Z] - one or more ASCII uppercase letters
    • \b - a word boundary.

Here is a Pandas test:

import pandas as pd
df = pd.DataFrame({'Text':['No data here','Dimensions: 23"/60', 'Dimensions: 23" / 60', 'Dimensions: 48"', 'Dimensions: 22.5X8.25', 'Dimensions: 80IN', 'Dimensions: 567 S','Dimensions: 22.5X8.25', 'Dimensions: 26INNP','Dimensions: 24" x 55" with pipe 16 x 7']})
regex = r'\d (?:\.\d )?"?(?:\s*[Xx/.]\s*\d (?:\.\d )?"?)?(?:\s*[A-Z] \b)?'
df['Numeric_Info'] = df['Text'].str.findall(regex).str.join('_')

Output:

>>> df
                                     Text      Numeric_Info
0                            No data here                  
1                      Dimensions: 23"/60            23"/60
2                    Dimensions: 23" / 60          23" / 60
3                         Dimensions: 48"               48"
4                   Dimensions: 22.5X8.25         22.5X8.25
5                        Dimensions: 80IN              80IN
6                       Dimensions: 567 S             567 S
7                   Dimensions: 22.5X8.25         22.5X8.25
8                      Dimensions: 26INNP            26INNP
9  Dimensions: 24" x 55" with pipe 16 x 7  24" x 55"_16 x 7
  • Related