Home > Net >  Extract store and location using regex and pandas
Extract store and location using regex and pandas

Time:11-03

I want to extract store name and location from the to_clean column. I would like to use some regex pattern that accounts for all the different patterns and stores the store name in the store column and the location name in the location column.

Something like this:

to_clean = ['ACTIVE BODY #1 - WEST VAN', 'BUY-LOW (80018) - KINGSGATE', 'CHOICES - CAMBIE #906',
     'CAUSEWAY MASSET - CAMBIE', 'COMMUNITY NATURAL S2 - CHINOOK', 'MEINHARDT - GRANVILLE (80068)',
     'WFM - CAMBIE - 10248']

store = ['ACTIVE BODY', 'BUY-LOW', 'CHOICES', 'CAUSEWAY MASSET', 'COMMUNITY NATURAL', 'MEINHARDT', 'WFM']

location = ['WEST VAN', 'KINGSGATE', 'CAMBIE','CAMBIE', 'CHINOOK', 'GRANVILLE', 'CAMBIE']

data = pd.DataFrame(list(zip(to_clean, store, location)), columns=['to_clean', 'store', 'location'])
data

enter image description here

Any help is greatly appreciated.

Thank you

CodePudding user response:

Try (I stored the new values under _new columns):

data["store_new"] = (
    data["to_clean"]
    .str.extract(r"^((?:[A-Z]|(?<=[A-Z])-|\s) )\b")[0]
    .str.strip()
)
data["location_new"] = (
    data["to_clean"]
    .str.extract(r"((?:[A-Z]|(?<=[A-Z])-|\s) )[^A-Z]*$")[0]
    .str.strip()
)

print(data)

Prints:

                         to_clean              store   location          store_new location_new
0       ACTIVE BODY #1 - WEST VAN        ACTIVE BODY   WEST VAN        ACTIVE BODY     WEST VAN
1     BUY-LOW (80018) - KINGSGATE            BUY-LOW  KINGSGATE            BUY-LOW    KINGSGATE
2           CHOICES - CAMBIE #906            CHOICES     CAMBIE            CHOICES       CAMBIE
3        CAUSEWAY MASSET - CAMBIE    CAUSEWAY MASSET     CAMBIE    CAUSEWAY MASSET       CAMBIE
4  COMMUNITY NATURAL S2 - CHINOOK  COMMUNITY NATURAL    CHINOOK  COMMUNITY NATURAL      CHINOOK
5   MEINHARDT - GRANVILLE (80068)          MEINHARDT  GRANVILLE          MEINHARDT    GRANVILLE
6            WFM - CAMBIE - 10248                WFM     CAMBIE                WFM       CAMBIE

CodePudding user response:

here is one way to do it

Assumptions:

  1. The store and location are separate by hyphen surrounded by spaces on both sides
# split  on ' - '
df[['store-1', 'location-1']] = df['to_clean'].str.split(r' - ',1, expand=True) 

# remove (),numbers
df[['store-1', 'location-1']]=df[['store-1', 'location-1']].replace(r'#\d |\(\d \)|\-\s\d ', '', regex=True) 
df




    to_clean                                    store   location        store-1           location-1
0   ACTIVE BODY #1 - WEST VAN               ACTIVE BODY     WEST VAN    ACTIVE BODY       WEST VAN
1   BUY-LOW (80018) - KINGSGATE                 BUY-LOW     KINGSGATE   BUY-LOW           KINGSGATE
2   CHOICES - CAMBIE #906                       CHOICES     CAMBIE      CHOICES           CAMBIE
3   CAUSEWAY MASSET - CAMBIE            CAUSEWAY MASSET     CAMBIE      CAUSEWAY MASSET   CAMBIE
4   COMMUNITY NATURAL S2 - CHINOOK    COMMUNITY NATURAL     CHINOOK     COMMUNITY NATURAL S2 CHINOOK
5   MEINHARDT - GRANVILLE (80068)             MEINHARDT     GRANVILLE   MEINHARDT         GRANVILLE
6   WFM - CAMBIE - 10248                            WFM     CAMBIE      WFM               CAMBIE
  • Related