Home > Software design >  Using str.extract with regex on pandas df column
Using str.extract with regex on pandas df column

Time:06-07

I have some address info. stored in a pandas df column like below:

df['Addr']
LT 75 CEDAR WOOD 3RD PL
LTS 22,25 & 26 MULLINS CORNER
LTS 7 & 8
PT LT 22-23 JEFFERSON HIGHLANDS EXTENSION

I want to extract lot information and create a new column so for the example above, my expected results are as below:

df['Lot']
75
22,25 & 26
7 & 8
22-23

This is my code:

df['Lot'] = df['Addr'].str.extract(r'\b(?:LOT|LT|LTS?) (\w (?:-\d )*)')

Results I'm getting is:

75
22
7
22-23

How can I modify my regex for expected results if at all possible? Please advise.

CodePudding user response:

You could use

\b(?:LOT|LTS?) (\d (?:(?:[-,]| & )\d )*)

Explanation

  • \b A word boundary
  • (?:LOT|LTS?) Match LOT or LT or LTS
  • ( Capture group 1
    • \d Match 1 digits
    • (?:(?:[-,]| & )\d )* Optionally repeat either - or , or & followed by 1 digits
  • ) Close group 1

Regex demo

data = [
    "LT 75 CEDAR WOOD 3RD PL",
    "LTS 22,25 & 26 MULLINS CORNER",
    "LTS 7 & 8",
    "PT LT 22-23 JEFFERSON HIGHLANDS EXTENSION"
]
df = pd.DataFrame(data, columns = ['Addr'])

df['Lot'] = df['Addr'].str.extract(r'\b(?:LO?T|LTS?) (\d (?:(?:[-,]| & )\d )*)')

print(df)

Output

                                        Addr         Lot
0                    LT 75 CEDAR WOOD 3RD PL          75
1              LTS 22,25 & 26 MULLINS CORNER  22,25 & 26
2                                  LTS 7 & 8       7 & 8
3  PT LT 22-23 JEFFERSON HIGHLANDS EXTENSION       22-23

If the - , and & can all be surrounded by optional whitespace chars, you might shorten the pattern to:

 \b(?:LOT|LTS?) (\d (?:\s*[-,&]\s*\d )*)\b

Regex demo

  • Related