Home > Enterprise >  Regular expression in Pandas: Get substring between a space and a colon
Regular expression in Pandas: Get substring between a space and a colon

Time:01-07

I have a Pandas dataframe with the column store. It contains a list of stores that look like this:

H-E-B 721:1101 W STAN SCHLUETER LOOP,KILLEEN,TX
H-E-B PLUS 39:2509 N MAIN ST,BELTON,TX

I want the store number, which are 721 and 39 in the above examples.

Here is my process for getting it:

  1. Find the position of the colon.
  2. Slice backwards until reaching a space.

How do I do this in Python/Pandas? I'm guessing that I need to use regex, but I have no idea how to start.

CodePudding user response:

You can use str.extract with the (\d ): regex:

df['number'] = df['store'].str.extract('(\d ):', expand=False).astype(int)

Output:

                                             store  number
0  H-E-B 721:1101 W STAN SCHLUETER LOOP,KILLEEN,TX     721
1           H-E-B PLUS 39:2509 N MAIN ST,BELTON,TX      39

regex demo

CodePudding user response:

You can apply a function to the store column

def retrieve_store_number(text):
    return text.split(":")[0].split()[-1]

df["store_number"] = df.store.apply(retrieve_store_number)

CodePudding user response:

To use regex in Pandas, you can use the extract function:

df['store_number'] = df['store'].str.extract(r'(\d (?=:))')

CodePudding user response:

You can do something along those lines:

def extract_number(string: str) -> int:
    return [int(i.split(":")[0]) for i in string.split(" ") if ":" in i][0]
df = pd.DataFrame([["H-E-B 721:1101 W STAN SCHLUETER LOOP,KILLEEN,TX"],
                   ["H-E-B PLUS 39:2509 N MAIN ST,BELTON,TX"]], columns = ["store"])
df["number"] = df["store"].apply(extract_number)
  • Related