Split Column on regex-CodePudding

I really struggle with regex, and I'm hoping for some help.

I have columns that look like this

import pandas as pd

data = {'Location': ['Building A, 100 First St City, State', 'Fire Station # 100, 2 Apple Row, City, State Zip', 'Church , 134 Baker Rd City, State']}

df = pd.DataFrame(data)

                                          Location
0              Building A, 100 First St City, State
1  Fire Station # 100, 2 Apple Row, City, State Zip
2                 Church , 134 Baker Rd City, State

I would like to get it to the code chunk below by splitting anytime there is a comma followed by space and then a number. However, I'm running into an issue where I'm removing the number.

        Location Name                        Address
0          Building A       100 First St City, State
1  Fire Station # 100  2 Apple Row, City, State, Zip
2              Church       134 Baker Rd City, State

This is the code I've been using

df['Location Name']= df['Location'].str.split('.,\s\d', expand=True)[0]
df['Address']= df['Location'].str.split('.,\s\d', expand=True)[1]

CodePudding user response：

You can use Series.str.extract:

df[['Location Name','Address']] = df['Location'].str.extract(r'^(.*?),\s(\d.*)', expand=True)

The ^(.*?),\s(\d.*) regex matches

^ - start of string
(.*?) - Group 1 ('Location Name'): any zero or more chars other than line break chars as few as possible
,\s - comma and whitespace
(\d.*) - Group 1 ('Address'): digit and the rest of the line.

See the regex demo.

CodePudding user response：

Another simple solution to your problem is to use a positive lookahead. You want to check if there is a number ahead of your pattern, while not including the number in the match. Here's an example of a regex that solves your problem:

\s?,\s(?=\d)

Here, we optionally remove a trailing whitespace, then match a comma followed by whitespace.

The (?= ) is a positive lookahead, in this case we check for a following digit. If that's matched, the split will remove the comma and whitespace only.