I have below data set, I need to extract 9 numeric ID from the "Notes" column. below some of the code I tried, but sometimes I don't get the correct output. sometimes there are few spaces between numbers, or a symbol, or sometimes there are numeric values that are not part of the ID etc.. any idea how to do this more efficient?
DF['Output'] = DF['Notes'].str.replace(' ' ' ' ' ' '-', '')
DF['Output'] = DF['Notes'].str.replace(' ' ' ' '-', '')
DF['Output'] = DF['Notes'].str.replace(' ' '-', '')
DF['Output'] = DF['Notes'].str.replace('-', '')
DF['Output'] = DF['Notes'].str.replace('\D', ' ')
DF['Output'] = DF['Notes'].str.findall(r'(\d{9,})').apply(', '.join)
Notes | Expected Output |
---|---|
ab. 325% xyz | 0 |
GHY12345678 9 | 123456789 |
FTY 234567 891 | 234567891 |
BNM 567 891 524; 123 Ltd | 567891524 |
2.5%mnkl, 3234 56 78 9; TGH 1235 z | 323456789 |
RTF 956 327-12 8 TYP | 956327128 |
X Y Z 1.59% 2345 567 81; one 35 in | 234556781 |
VTO 126%, 12345 67 | 0 |
2.6% 1234 ABC 3456 1 2 4 91 | 345612491 |
CodePudding user response:
# replace known character in b/w the numbers with null
# extract the 9 digits
df['output']=(df['Notes'].str.replace(r'[\s|\-]','',regex=True)
.str.extract(r'(\d{9})').fillna(0))
df
Notes Expected Output output
0 ab. 325% xyz 0 0
1 GHY12345678 9 123456789 123456789
2 FTY 234567 891 234567891 234567891
3 BNM 567 891 524; 123 Ltd 567891524 567891524
4 2.5%mnkl, 3234 56 78 9; TGH 1235 z 323456789 323456789
5 RTF 956 327-12 8 TYP 956327128 956327128
6 X Y Z 1.59% 2345 567 81; one 35 in 234556781 234556781
7 VTO 126%, 12345 67 0 0
8 2.6% 1234 ABC 3456 1 2 4 91 345612491 345612491
CodePudding user response:
Using str.replace
to first strip off spaces and dashes, followed by str.extract
to find 9 digit numbers, we can try:
DF["Output"] = DF["Notes"].str.replace('[ -] ', '', regex=True)
.str.extract(r'(?<!\d)(\d{9})(?!\d)')
For an explanation of the regex pattern, we place non digit boundary markers around \d{9}
to ensure that we only match 9 digit numbers. Here is how the regex works:
(?<!\d)
ensure that what precedes is a non digit OR the start of the column(\d{9})
match and capture exactly 9 digits(?!\d)
ensure that what follows is a non digit OR the end of the column