I am trying to extract whatever appears after the word SUITE, suite, ste, Ste. or the character # in a pandas dataframe column. The examples are like below
4230 Harding Pike Ste 435
4230 Harding Pike Suite 435
4230 Harding Pike SUITE A
4230 Harding Pike Ste. 101
4230 Harding Pike SUITE B-200
4230 Harding Pike #900
4230 Harding Pike STE 503
4230 Harding Pike SUITE 300
4230 Harding Pike Ste 700
4230 Harding Pike SUITE #2
Result:
SuiteNos
435
435
A
101
B-200
900
503
300
700
2
I tried with the below regexes but it does not work as expected -
df_merged['address'].str.extract(r'\**\**suite.*\**\**')
df_merged['address'] = re.findall('@([suite] )', df_merged['address'])
CodePudding user response:
Using a case insensitive match, you could use a capture group with an alternation:
(?i)(?:#|Suite|ste\.?)\s*([^\s#].*)
The pattern matches:
(?:
Non capture groupSuite
Match literally|
Orste\.?
Match ste with an optional dot|
Or#
Match literally
)
close non capture group\s*
Match optional whitspace chars([^\s#].*)
Capture group 1, start with matching a non whitespace char without # and the rest of the line
import pandas as pd
strings = [
"4230 Harding Pike Ste 435",
"4230 Harding Pike Suite 435",
"4230 Harding Pike SUITE A",
"4230 Harding Pike Ste. 101",
"4230 Harding Pike SUITE B-200",
"4230 Harding Pike #900",
"4230 Harding Pike STE 503",
"4230 Harding Pike SUITE 300",
"4230 Harding Pike Ste 700",
"4230 Harding Pike SUITE #2"
]
pattern = r"(?i)(?:#|Suite|ste\.?)\s*([^\s#].*)"
df_merged = pd.DataFrame(strings, columns = ['address'])
df_merged['SuiteNos'] = df_merged['address'].str.extract(pattern)
print(df_merged["SuiteNos"])
Output
0 435
1 435
2 A
3 101
4 B-200
5 900
6 503
7 300
8 700
9 2
CodePudding user response:
With your shown samples, please try following regex.
(?:(?:[Ss](?:[uU][iI])?[tT][eE]\.?\s )#?|#)(\S )
In Pandas run following code:
df['value'].str.extract(r'(?:(?:[Ss](?:[uU][iI])?[tT][eE]\.?\s )#?|#)(\S )')
Output from above code in Pandas with samples provided by OP will be as follows:
0 435
1 435
2 A
3 101
4 B-200
5 NaN
6 503
7 300
8 700
9 2
Name: value, dtype: object
Explanation: Adding detailed explanation for above.
(?: ##Starting a non-capturing group here.
(?: ##Starting another non-capturing group here.
[Ss] ##Matching small OR capital S here.
(?:[uU][iI])? ##In a non-capturing group matching u/U i/I and keep this optional.
[tT][eE]\.?\s ##Matching t or T followed by e or E here followed by optional dot and 1 or more spaces here.
) ##Closing 2nd non-capturing group here.
#?|# ##Matching # keeping it optional OR matching it no optional.
) ##Closing 1st non-capturing group here.
(\S ) ##Creating 1st capturing group which contains all non-spaces in it.
Thanks to @The fourth bird for helping in tweak to make only 1 capturing group to catch all matched things.