Regex to extract irregular delimiters-CodePudding

I have a column of data containing id numbers that are between 4 and 10 digits in length. However, these id numbers are manually entered and have no systematic delimiters. In some cases, id numbers are delimited by a comment. With the caveat that the real data is unpredictable, here is an example of values in a python list.

[ '13796352',  
'2113146, 2113148, 2113147',  
'asdf ee A070_321 on 4.3.99 - MC',  
'blah blah3', 
'1914844\xa0, 3310339, 1943270, 2190351, 1215262',  
'789702/ 89057',  
'1 of 5 blah blah', 
'688327/ 6712563/> 5425153',  
'1820196/1964143/ 249805/ 300510',
'731862\n\nAccepted: 176666\nRejected: 8787' ]

Here is the regex that is not working:

r'^[0-9]{4,10}([\s\S]*)[[0-9]{4,10}]*'

The desired output (looping through the list) is:

[''],
[', ',', '],
[''], 
[''],
['\xa0, ',', ',', ',', '], 
['/ '],  
[''], 
['/ ,'/> '],  
[''/','/ ','/ '],
['\n\nAccepted: ','\nRejected: ']

I am not getting this with the regex above. What am I doing wrong?

CodePudding user response：

This is just a quick sketch but it looks pretty close to what you want. Basically try to match 4 or more digits, split at the matches and exclude

empty strings
entries without any matches.

>>> data = [...] # your sample
>>> num_re = re.compile(r'\d{4,}')
>>> [[x for x in num_re.split(d) if x] if num_re.search(d) else [] for d in data]
[[],
 [', ', ', '],
 [],
 [],
 ['\xa0, ', ', ', ', ', ', '],
 ['/ '],
 [],
 ['/ ', '/> '],
 ['/', '/ ', '/ '],
 ['\n\nAccepted: ', '\nRejected: ']]

CodePudding user response：

If you want to extract the ids, you could use for example:

import re

data = [
  '13796352',  
  '2113146, 2113148, 2113147',  
  'asdf ee A070_321 on 4.3.99 - MC',  
  'blah blah3', 
  '1914844\xa0, 3310339, 1943270, 2190351, 1215262',  
  '789702/ 89057',  
  '1 of 5 blah blah', 
  '688327/ 6712563/> 5425153',  
  '1820196/1964143/ 249805/ 300510',
  '731862\n\nAccepted: 176666\nRejected: 8787'
]

for el in data:
  print(re.findall(r'(?<!\d)\d{4,10}(?!\d)', el))

Resulting in:

['13796352']
['2113146', '2113148', '2113147']
[]
[]
['1914844', '3310339', '1943270', '2190351', '1215262']
['789702', '89057']
[]
['688327', '6712563', '5425153']
['1820196', '1964143', '249805', '300510']
['731862', '176666', '8787']

(?<!\d)\d{4,10}(?!\d) means match a sequence of 4 to 10 digits that is not preceded or followed by a digit.