I'm having trouble applying a regex function to a column in a python dataframe. It works fine in Pythex online editor.
Here is the head of my dataframe -
ID | Text |
---|---|
1 | UMM SURE THE ADDRESS IS IN 25088 KITTAN DRIVE NORTH CAROLINA 28605 |
2 | IT IS ON 26 W STREET 7TH HIGHWAY ORLANDO FLORIDA 28262 |
3 | COOL 757979 EAST TYRON BLVD NEW YORK NEW YORK 29875 |
I've tried the following code to create another column which gives us just the address. but the new column is showing up as empty.
df['Address']=df['Text'].str.findall('[0-9]{2,6}(?:\s \S ){3,8}\s{1,}\b(?:FLORIDA|NORTH CAROLINA|NEW YORK)\b')
The desired output should look like -
ID | Text | Address |
---|---|---|
1 | UMM SURE THE ADDRESS IS IN 25088 KITTAN DRIVE NORTH CAROLINA 28605 | 25088 KITTAN DRIVE NORTH CAROLINA |
2 | IT IS ON 26 W STREET 7TH HIGHWAY ORLANDO FLORIDA 28262 | 26 W STREET 7TH HIGHWAY ORLANDO FLORIDA |
3 | COOL 757979 EAST TYRON BLVD NEW YORK NEW YORK 29875 | 757979 EAST TYRON BLVD NEW YORK NEW YORK |
Thanks in advance.
CodePudding user response:
If your text data are examples of this pattern, you can try the following code:
df['Address']=df['Text'].str.findall(r'[0-9]{2,6}(.*?)(?:\d $)')
CodePudding user response:
You could use a pattern to extract the values that you want from column Text:
\b([0-9]{2,6}\b.*?(?:FLORIDA|NORTH CAROLINA|NEW YORK)) \d
The pattern matches:
\b
A word boundary to prevent a partial word match(
Capture group 1[0-9]{2,6}\b
Match 2-6 digits followed by a word boundary.*?(?:FLORIDA|NORTH CAROLINA|NEW YORK)
Match as least as possible chars until you can match one of the alternatives
) \d
Close group 1, and match a space and a digit
See a regex demo.
For example
import pandas as pd
items = [
[1, "UMM SURE THE ADDRESS IS IN 25088 KITTAN DRIVE NORTH CAROLINA 28605"],
[2, "IT IS ON 26 W STREET 7TH HIGHWAY ORLANDO FLORIDA 28262"],
[3, "COOL 757979 EAST TYRON BLVD NEW YORK NEW YORK 29875"]
]
df = pd.DataFrame(items, columns=["ID", "Text"])
df["Address"] = df["Text"].str.extract(
r'\b([0-9]{2,6}\b.*?(?:FLORIDA|NORTH CAROLINA|NEW YORK)) \d'
)
print(df)
Output
ID Text Address
0 1 UMM SURE THE ADDRESS IS IN 25088 KITTAN DRIVE ... 25088 KITTAN DRIVE NORTH CAROLINA
1 2 IT IS ON 26 W STREET 7TH HIGHWAY ORLANDO FLORI... 26 W STREET 7TH HIGHWAY ORLANDO FLORIDA
2 3 COOL 757979 EAST TYRON BLVD NEW YORK NEW YORK ... 757979 EAST TYRON BLVD NEW YORK NEW YORK