Extract ID numeric value from a column-CodePudding

I have below data set, I need to extract 9 numeric ID from the "Notes" column. below some of the code I tried, but sometimes I don't get the correct output. sometimes there are few spaces between numbers, or a symbol, or sometimes there are numeric values that are not part of the ID etc.. any idea how to do this more efficient?

DF['Output'] = DF['Notes'].str.replace(' ' ' ' ' ' '-', '')
DF['Output'] = DF['Notes'].str.replace(' ' ' ' '-', '')
DF['Output'] = DF['Notes'].str.replace(' ' '-', '')
DF['Output'] = DF['Notes'].str.replace('-', '')

DF['Output'] = DF['Notes'].str.replace('\D', ' ')

DF['Output'] = DF['Notes'].str.findall(r'(\d{9,})').apply(', '.join)

Notes	Expected Output
ab. 325% xyz	0
GHY12345678 9	123456789
FTY 234567 891	234567891
BNM 567 891 524; 123 Ltd	567891524
2.5%mnkl, 3234 56 78 9; TGH 1235 z	323456789
RTF 956 327-12 8 TYP	956327128
X Y Z 1.59% 2345 567 81; one 35 in	234556781
VTO 126%, 12345 67	0
2.6% 1234 ABC 3456 1 2 4 91	345612491

CodePudding user response：

# replace known character in b/w the numbers with null
# extract the 9 digits
df['output']=(df['Notes'].str.replace(r'[\s|\-]','',regex=True)
              .str.extract(r'(\d{9})').fillna(0))
df

    Notes                           Expected Output     output
0   ab. 325% xyz                                 0          0
1   GHY12345678 9                        123456789  123456789
2   FTY 234567 891                       234567891  234567891
3   BNM 567 891 524; 123 Ltd             567891524  567891524
4   2.5%mnkl, 3234 56 78 9; TGH 1235 z   323456789  323456789
5   RTF 956 327-12 8 TYP                 956327128  956327128
6   X Y Z 1.59% 2345 567 81; one 35 in   234556781  234556781
7   VTO 126%, 12345 67                           0  0
8   2.6% 1234 ABC 3456 1 2 4 91          345612491  345612491

CodePudding user response：

Using str.replace to first strip off spaces and dashes, followed by str.extract to find 9 digit numbers, we can try:

DF["Output"] = DF["Notes"].str.replace('[ -] ', '', regex=True)
                          .str.extract(r'(?<!\d)(\d{9})(?!\d)')

For an explanation of the regex pattern, we place non digit boundary markers around \d{9} to ensure that we only match 9 digit numbers. Here is how the regex works:

(?<!\d) ensure that what precedes is a non digit OR the start of the column
(\d{9}) match and capture exactly 9 digits
(?!\d) ensure that what follows is a non digit OR the end of the column