I have a pandas column like below:
df['description']
0. PRAIRIE HIGHLANDS SIXTH PLAT Lt: 156 PIN# DP73770000 0156 312 ABC
1. PRAIRIE VILLAGE PIN# OP55000034 0020A Rmrk: PT OF
2. Sub: HOLLY GREEN Lt: 14 Bl: 1 PIN# DP34500001 0D14
3. FAIRWAY PIN# GP20000006 0029 Rmrk: W
I want to extract out PIN#
such that expected results:
DP73770000 0156 312
OP55000034 0020A
DP34500001 0D14
GP20000006 0029
What I have tried:
df['PIN'] = df['description'].str.extract(r'\b(?:PIN# ?) (\w (?:(?:\s*[ ]) \s*\d )*)\b')
Results I got:
DP73770000 0156 312
OP55000034
DP34500001
GP20000006 0029
The ones that have letters in between the numbers after space are not getting picked up. I have tried a couple other regex as well to no success. How can I resolve this? Please help.
CodePudding user response:
I would use str.extract
as follows:
df["PIN"] = df["description"].str.extract(r'PIN#((?: [A-Z0-9]*[0-9][A-Z0-9]*)*)')
Here is a link to a running regex demo showing that the logic is working.