I am trying to extract the highlighted "numeric information" from a Pandas DataFrame column:
Text |
---|
Dimensions: 23"/60 |
Dimensions: 23" / 60 |
Dimensions: 48" |
Dimensions: 22.5X8.25 |
Dimensions: 80IN |
Dimensions: 567 S |
Dimensions: 22.5X8.25 |
Dimensions: 26INNP |
Dimensions: 24" x 55" with pipe 16 x 7 |
I am using regex and is as follows: regex = r"(\d([^\s]*)\s.\s\d*[^\s])|(\d([^\s])*)"
I am using to the below script to create a new column with all the numeric information extracted from each sentence and combined with '_'
df2['Numeric_Info'] = df2['Text'].apply(lambda x: '_'.join([i[0] for i in re.findall(regex, str(x))]))
Can someone tell me what I am doing wrong?
Edit:
Specifying rules to the pattern to solidify the regex:
- Sentences containing words with numbers should be extracted (for sentence
short A20-6014 complete
,A20-6014
should be extracted) - Sentences beginning with words with numbers should be extracted (for sentence
A20-6014 short complete
,A20-6014
should be extracted) - Sentences ending with words with numbers should be extracted (for sentence
short complete A20-6014
,A20-6014
should be extracted) - Two letter words in between words containing numbers should be considered (ie. for sentence
ELBOW- 2.5 IN 90 SIL
,2.5 IN 90
should be extracted) - Greater than 2 letter words in between words containing numbers should extracted separately and joined with a
_
. This also applies for alphanumeric words occurring at two different places in a sentence (ie. for sentenceELBOW- 2.5 IND 90 SIL
,2.5_90
should be extracted. ) - Another Test Case: for a sentence
90 CONN 3/4 ST-#8 FL
,90_3/4_ST-#8
should be extracted
CodePudding user response:
You can use
regex = r'\d (?:\.\d )?"?(?:\s*[Xx/.]\s*\d (?:\.\d )?"?)?(?:\s*[A-Z] \b)?'
df2['Numeric_Info'] = df2['Text'].str.findall(regex).str.join('_')
See the regex demo.
Details:
\d (?:\.\d )?
- an int or float value"?
- an optional"
char(?:\s*[Xx/.]\s*\d (?:\.\d )?"?)?
- an optional sequence of\s*[Xx/.]\s*
-X
,x
,/
or.
enclosed with zero or more whitespaces\d (?:\.\d )?
- an int or float value"?
- an optional"
char
(?:\s*[A-Z] \b)?
- an optional sequence of\s*
- zero or more whitespaces[A-Z]
- one or more ASCII uppercase letters\b
- a word boundary.
Here is a Pandas test:
import pandas as pd
df = pd.DataFrame({'Text':['No data here','Dimensions: 23"/60', 'Dimensions: 23" / 60', 'Dimensions: 48"', 'Dimensions: 22.5X8.25', 'Dimensions: 80IN', 'Dimensions: 567 S','Dimensions: 22.5X8.25', 'Dimensions: 26INNP','Dimensions: 24" x 55" with pipe 16 x 7']})
regex = r'\d (?:\.\d )?"?(?:\s*[Xx/.]\s*\d (?:\.\d )?"?)?(?:\s*[A-Z] \b)?'
df['Numeric_Info'] = df['Text'].str.findall(regex).str.join('_')
Output:
>>> df
Text Numeric_Info
0 No data here
1 Dimensions: 23"/60 23"/60
2 Dimensions: 23" / 60 23" / 60
3 Dimensions: 48" 48"
4 Dimensions: 22.5X8.25 22.5X8.25
5 Dimensions: 80IN 80IN
6 Dimensions: 567 S 567 S
7 Dimensions: 22.5X8.25 22.5X8.25
8 Dimensions: 26INNP 26INNP
9 Dimensions: 24" x 55" with pipe 16 x 7 24" x 55"_16 x 7