Home > OS >  Extract words based on regex
Extract words based on regex

Time:11-12

I am trying to extract whatever appears after the word SUITE, suite, ste, Ste. or the character # in a pandas dataframe column. The examples are like below

4230 Harding Pike Ste 435
4230 Harding Pike Suite 435
4230 Harding Pike SUITE A
4230 Harding Pike Ste. 101
4230 Harding Pike SUITE B-200
4230 Harding Pike #900
4230 Harding Pike STE 503
4230 Harding Pike SUITE 300
4230 Harding Pike Ste 700
4230 Harding Pike SUITE #2

Result:

SuiteNos

435
435  
A
101
B-200
900
503
300
700
2

I tried with the below regexes but it does not work as expected -

 df_merged['address'].str.extract(r'\**\**suite.*\**\**')
 df_merged['address'] = re.findall('@([suite] )', df_merged['address'])

CodePudding user response:

Using a case insensitive match, you could use a capture group with an alternation:

(?i)(?:#|Suite|ste\.?)\s*([^\s#].*)

The pattern matches:

  • (?: Non capture group
    • Suite Match literally
    • | Or
    • ste\.? Match ste with an optional dot
    • | Or
    • # Match literally
  • ) close non capture group
  • \s* Match optional whitspace chars
  • ([^\s#].*) Capture group 1, start with matching a non whitespace char without # and the rest of the line

Regex demo

import pandas as pd

strings = [
    "4230 Harding Pike Ste 435",
    "4230 Harding Pike Suite 435",
    "4230 Harding Pike SUITE A",
    "4230 Harding Pike Ste. 101",
    "4230 Harding Pike SUITE B-200",
    "4230 Harding Pike #900",
    "4230 Harding Pike STE 503",
    "4230 Harding Pike SUITE 300",
    "4230 Harding Pike Ste 700",
    "4230 Harding Pike SUITE #2"
]

pattern = r"(?i)(?:#|Suite|ste\.?)\s*([^\s#].*)"
df_merged = pd.DataFrame(strings, columns = ['address'])
df_merged['SuiteNos'] = df_merged['address'].str.extract(pattern)

print(df_merged["SuiteNos"])

Output

0      435
1      435
2        A
3      101
4    B-200
5      900
6      503
7      300
8      700
9        2

CodePudding user response:

With your shown samples, please try following regex.

(?:(?:[Ss](?:[uU][iI])?[tT][eE]\.?\s )#?|#)(\S )

In Pandas run following code:

df['value'].str.extract(r'(?:(?:[Ss](?:[uU][iI])?[tT][eE]\.?\s )#?|#)(\S )')

Output from above code in Pandas with samples provided by OP will be as follows:

0      435
1      435
2        A
3      101
4    B-200
5      NaN
6      503
7      300
8      700
9        2
Name: value, dtype: object

Online demo for above regex

Explanation: Adding detailed explanation for above.

(?:                   ##Starting a non-capturing group here.
  (?:                 ##Starting another non-capturing group here.
     [Ss]             ##Matching small OR capital S here.
     (?:[uU][iI])?    ##In a non-capturing group matching u/U i/I and keep this optional.
     [tT][eE]\.?\s    ##Matching t or T followed by e or E here followed by optional dot and 1 or more spaces here.
  )                   ##Closing 2nd non-capturing group here.
  #?|#                ##Matching # keeping it optional OR matching it no optional.
)                     ##Closing 1st non-capturing group here.
(\S )                 ##Creating 1st capturing group which contains all non-spaces in it.

Thanks to @The fourth bird for helping in tweak to make only 1 capturing group to catch all matched things.

  • Related