Regular expression working in Pythex.com but not in pandas-CodePudding

I'm having trouble applying a regex function to a column in a python dataframe. It works fine in Pythex online editor.

Here is the head of my dataframe -

ID	Text
1	UMM SURE THE ADDRESS IS IN 25088 KITTAN DRIVE NORTH CAROLINA 28605
2	IT IS ON 26 W STREET 7TH HIGHWAY ORLANDO FLORIDA 28262
3	COOL 757979 EAST TYRON BLVD NEW YORK NEW YORK 29875

I've tried the following code to create another column which gives us just the address. but the new column is showing up as empty.

df['Address']=df['Text'].str.findall('[0-9]{2,6}(?:\s \S ){3,8}\s{1,}\b(?:FLORIDA|NORTH CAROLINA|NEW YORK)\b')

The desired output should look like -

ID	Text	Address
1	UMM SURE THE ADDRESS IS IN 25088 KITTAN DRIVE NORTH CAROLINA 28605	25088 KITTAN DRIVE NORTH CAROLINA
2	IT IS ON 26 W STREET 7TH HIGHWAY ORLANDO FLORIDA 28262	26 W STREET 7TH HIGHWAY ORLANDO FLORIDA
3	COOL 757979 EAST TYRON BLVD NEW YORK NEW YORK 29875	757979 EAST TYRON BLVD NEW YORK NEW YORK

Thanks in advance.

CodePudding user response：

If your text data are examples of this pattern, you can try the following code:

df['Address']=df['Text'].str.findall(r'[0-9]{2,6}(.*?)(?:\d $)')

CodePudding user response：

You could use a pattern to extract the values that you want from column Text:

\b([0-9]{2,6}\b.*?(?:FLORIDA|NORTH CAROLINA|NEW YORK)) \d

The pattern matches:

\b A word boundary to prevent a partial word match
( Capture group 1
- [0-9]{2,6}\b Match 2-6 digits followed by a word boundary
- .*?(?:FLORIDA|NORTH CAROLINA|NEW YORK) Match as least as possible chars until you can match one of the alternatives
) \d Close group 1, and match a space and a digit

See a regex demo.

For example

import pandas as pd

items = [
    [1, "UMM SURE THE ADDRESS IS IN 25088 KITTAN DRIVE NORTH CAROLINA 28605"],
    [2, "IT IS ON 26 W STREET 7TH HIGHWAY ORLANDO FLORIDA 28262"],
    [3, "COOL 757979 EAST TYRON BLVD NEW YORK NEW YORK 29875"]
]

df = pd.DataFrame(items, columns=["ID", "Text"])
df["Address"] = df["Text"].str.extract(
    r'\b([0-9]{2,6}\b.*?(?:FLORIDA|NORTH CAROLINA|NEW YORK)) \d'
)
print(df)

Output

   ID                                               Text                                   Address
0   1  UMM SURE THE ADDRESS IS IN 25088 KITTAN DRIVE ...         25088 KITTAN DRIVE NORTH CAROLINA
1   2  IT IS ON 26 W STREET 7TH HIGHWAY ORLANDO FLORI...   26 W STREET 7TH HIGHWAY ORLANDO FLORIDA
2   3  COOL 757979 EAST TYRON BLVD NEW YORK NEW YORK ...  757979 EAST TYRON BLVD NEW YORK NEW YORK