I have some address info. stored in a pandas df column like below:
df['Addr']
LT 75 CEDAR WOOD 3RD PL
LTS 22,25 & 26 MULLINS CORNER
LTS 7 & 8
PT LT 22-23 JEFFERSON HIGHLANDS EXTENSION
I want to extract lot information and create a new column so for the example above, my expected results are as below:
df['Lot']
75
22,25 & 26
7 & 8
22-23
This is my code:
df['Lot'] = df['Addr'].str.extract(r'\b(?:LOT|LT|LTS?) (\w (?:-\d )*)')
Results I'm getting is:
75
22
7
22-23
How can I modify my regex for expected results if at all possible? Please advise.
CodePudding user response:
You could use
\b(?:LOT|LTS?) (\d (?:(?:[-,]| & )\d )*)
Explanation
\b
A word boundary(?:LOT|LTS?)
MatchLOT
orLT
orLTS
(
Capture group 1\d
Match 1 digits(?:(?:[-,]| & )\d )*
Optionally repeat either-
or,
or&
followed by 1 digits
)
Close group 1
data = [
"LT 75 CEDAR WOOD 3RD PL",
"LTS 22,25 & 26 MULLINS CORNER",
"LTS 7 & 8",
"PT LT 22-23 JEFFERSON HIGHLANDS EXTENSION"
]
df = pd.DataFrame(data, columns = ['Addr'])
df['Lot'] = df['Addr'].str.extract(r'\b(?:LO?T|LTS?) (\d (?:(?:[-,]| & )\d )*)')
print(df)
Output
Addr Lot
0 LT 75 CEDAR WOOD 3RD PL 75
1 LTS 22,25 & 26 MULLINS CORNER 22,25 & 26
2 LTS 7 & 8 7 & 8
3 PT LT 22-23 JEFFERSON HIGHLANDS EXTENSION 22-23
If the -
,
and &
can all be surrounded by optional whitespace chars, you might shorten the pattern to:
\b(?:LOT|LTS?) (\d (?:\s*[-,&]\s*\d )*)\b