Home > Mobile >  Extract ID numeric value from a column
Extract ID numeric value from a column

Time:10-30

I have below data set, I need to extract 9 numeric ID from the "Notes" column. below some of the code I tried, but sometimes I don't get the correct output. sometimes there are few spaces between numbers, or a symbol, or sometimes there are numeric values that are not part of the ID etc.. any idea how to do this more efficient?

DF['Output'] = DF['Notes'].str.replace(' ' ' ' ' ' '-', '')
DF['Output'] = DF['Notes'].str.replace(' ' ' ' '-', '')
DF['Output'] = DF['Notes'].str.replace(' ' '-', '')
DF['Output'] = DF['Notes'].str.replace('-', '')

DF['Output'] = DF['Notes'].str.replace('\D', ' ')

DF['Output'] = DF['Notes'].str.findall(r'(\d{9,})').apply(', '.join)
Notes Expected Output
ab. 325% xyz 0
GHY12345678 9 123456789
FTY 234567 891 234567891
BNM 567 891 524; 123 Ltd 567891524
2.5%mnkl, 3234 56 78 9; TGH 1235 z 323456789
RTF 956 327-12 8 TYP 956327128
X Y Z 1.59% 2345 567 81; one 35 in 234556781
VTO 126%, 12345 67 0
2.6% 1234 ABC 3456 1 2 4 91 345612491

CodePudding user response:

# replace known character in b/w the numbers with null
# extract the 9 digits
df['output']=(df['Notes'].str.replace(r'[\s|\-]','',regex=True)
              .str.extract(r'(\d{9})').fillna(0))
df
    Notes                           Expected Output     output
0   ab. 325% xyz                                 0          0
1   GHY12345678 9                        123456789  123456789
2   FTY 234567 891                       234567891  234567891
3   BNM 567 891 524; 123 Ltd             567891524  567891524
4   2.5%mnkl, 3234 56 78 9; TGH 1235 z   323456789  323456789
5   RTF 956 327-12 8 TYP                 956327128  956327128
6   X Y Z 1.59% 2345 567 81; one 35 in   234556781  234556781
7   VTO 126%, 12345 67                           0  0
8   2.6% 1234 ABC 3456 1 2 4 91          345612491  345612491

CodePudding user response:

Using str.replace to first strip off spaces and dashes, followed by str.extract to find 9 digit numbers, we can try:

DF["Output"] = DF["Notes"].str.replace('[ -] ', '', regex=True)
                          .str.extract(r'(?<!\d)(\d{9})(?!\d)')

For an explanation of the regex pattern, we place non digit boundary markers around \d{9} to ensure that we only match 9 digit numbers. Here is how the regex works:

  • (?<!\d) ensure that what precedes is a non digit OR the start of the column
  • (\d{9}) match and capture exactly 9 digits
  • (?!\d) ensure that what follows is a non digit OR the end of the column
  • Related