I want to extract store name and location from the to_clean
column. I would like to use some regex pattern that accounts for all the different patterns and stores the store name in the store
column and the location name in the location
column.
Something like this:
to_clean = ['ACTIVE BODY #1 - WEST VAN', 'BUY-LOW (80018) - KINGSGATE', 'CHOICES - CAMBIE #906',
'CAUSEWAY MASSET - CAMBIE', 'COMMUNITY NATURAL S2 - CHINOOK', 'MEINHARDT - GRANVILLE (80068)',
'WFM - CAMBIE - 10248']
store = ['ACTIVE BODY', 'BUY-LOW', 'CHOICES', 'CAUSEWAY MASSET', 'COMMUNITY NATURAL', 'MEINHARDT', 'WFM']
location = ['WEST VAN', 'KINGSGATE', 'CAMBIE','CAMBIE', 'CHINOOK', 'GRANVILLE', 'CAMBIE']
data = pd.DataFrame(list(zip(to_clean, store, location)), columns=['to_clean', 'store', 'location'])
data
Any help is greatly appreciated.
Thank you
CodePudding user response:
Try (I stored the new values under _new
columns):
data["store_new"] = (
data["to_clean"]
.str.extract(r"^((?:[A-Z]|(?<=[A-Z])-|\s) )\b")[0]
.str.strip()
)
data["location_new"] = (
data["to_clean"]
.str.extract(r"((?:[A-Z]|(?<=[A-Z])-|\s) )[^A-Z]*$")[0]
.str.strip()
)
print(data)
Prints:
to_clean store location store_new location_new
0 ACTIVE BODY #1 - WEST VAN ACTIVE BODY WEST VAN ACTIVE BODY WEST VAN
1 BUY-LOW (80018) - KINGSGATE BUY-LOW KINGSGATE BUY-LOW KINGSGATE
2 CHOICES - CAMBIE #906 CHOICES CAMBIE CHOICES CAMBIE
3 CAUSEWAY MASSET - CAMBIE CAUSEWAY MASSET CAMBIE CAUSEWAY MASSET CAMBIE
4 COMMUNITY NATURAL S2 - CHINOOK COMMUNITY NATURAL CHINOOK COMMUNITY NATURAL CHINOOK
5 MEINHARDT - GRANVILLE (80068) MEINHARDT GRANVILLE MEINHARDT GRANVILLE
6 WFM - CAMBIE - 10248 WFM CAMBIE WFM CAMBIE
CodePudding user response:
here is one way to do it
Assumptions:
- The store and location are separate by hyphen surrounded by spaces on both sides
# split on ' - '
df[['store-1', 'location-1']] = df['to_clean'].str.split(r' - ',1, expand=True)
# remove (),numbers
df[['store-1', 'location-1']]=df[['store-1', 'location-1']].replace(r'#\d |\(\d \)|\-\s\d ', '', regex=True)
df
to_clean store location store-1 location-1
0 ACTIVE BODY #1 - WEST VAN ACTIVE BODY WEST VAN ACTIVE BODY WEST VAN
1 BUY-LOW (80018) - KINGSGATE BUY-LOW KINGSGATE BUY-LOW KINGSGATE
2 CHOICES - CAMBIE #906 CHOICES CAMBIE CHOICES CAMBIE
3 CAUSEWAY MASSET - CAMBIE CAUSEWAY MASSET CAMBIE CAUSEWAY MASSET CAMBIE
4 COMMUNITY NATURAL S2 - CHINOOK COMMUNITY NATURAL CHINOOK COMMUNITY NATURAL S2 CHINOOK
5 MEINHARDT - GRANVILLE (80068) MEINHARDT GRANVILLE MEINHARDT GRANVILLE
6 WFM - CAMBIE - 10248 WFM CAMBIE WFM CAMBIE