I really struggle with regex, and I'm hoping for some help.
I have columns that look like this
import pandas as pd
data = {'Location': ['Building A, 100 First St City, State', 'Fire Station # 100, 2 Apple Row, City, State Zip', 'Church , 134 Baker Rd City, State']}
df = pd.DataFrame(data)
Location
0 Building A, 100 First St City, State
1 Fire Station # 100, 2 Apple Row, City, State Zip
2 Church , 134 Baker Rd City, State
I would like to get it to the code chunk below by splitting anytime there is a comma followed by space and then a number. However, I'm running into an issue where I'm removing the number.
Location Name Address
0 Building A 100 First St City, State
1 Fire Station # 100 2 Apple Row, City, State, Zip
2 Church 134 Baker Rd City, State
This is the code I've been using
df['Location Name']= df['Location'].str.split('.,\s\d', expand=True)[0]
df['Address']= df['Location'].str.split('.,\s\d', expand=True)[1]
CodePudding user response:
You can use Series.str.extract
:
df[['Location Name','Address']] = df['Location'].str.extract(r'^(.*?),\s(\d.*)', expand=True)
The ^(.*?),\s(\d.*)
regex matches
^
- start of string(.*?)
- Group 1 ('Location Name'): any zero or more chars other than line break chars as few as possible,\s
- comma and whitespace(\d.*)
- Group 1 ('Address'): digit and the rest of the line.
See the regex demo.
CodePudding user response:
Another simple solution to your problem is to use a positive lookahead. You want to check if there is a number ahead of your pattern, while not including the number in the match. Here's an example of a regex that solves your problem:
\s?,\s(?=\d)
Here, we optionally remove a trailing whitespace, then match a comma followed by whitespace.
The (?= )
is a positive lookahead, in this case we check for a following digit. If that's matched, the split will remove the comma and whitespace only.