I have a column of data containing id numbers that are between 4 and 10 digits in length. However, these id numbers are manually entered and have no systematic delimiters. In some cases, id numbers are delimited by a comment. With the caveat that the real data is unpredictable, here is an example of values in a python list.
[ '13796352',
'2113146, 2113148, 2113147',
'asdf ee A070_321 on 4.3.99 - MC',
'blah blah3',
'1914844\xa0, 3310339, 1943270, 2190351, 1215262',
'789702/ 89057',
'1 of 5 blah blah',
'688327/ 6712563/> 5425153',
'1820196/1964143/ 249805/ 300510',
'731862\n\nAccepted: 176666\nRejected: 8787' ]
Here is the regex that is not working:
r'^[0-9]{4,10}([\s\S]*)[[0-9]{4,10}]*'
The desired output (looping through the list) is:
[''],
[', ',', '],
[''],
[''],
['\xa0, ',', ',', ',', '],
['/ '],
[''],
['/ ,'/> '],
[''/','/ ','/ '],
['\n\nAccepted: ','\nRejected: ']
I am not getting this with the regex above. What am I doing wrong?
CodePudding user response:
This is just a quick sketch but it looks pretty close to what you want. Basically try to match 4 or more digits, split at the matches and exclude
- empty strings
- entries without any matches.
>>> data = [...] # your sample
>>> num_re = re.compile(r'\d{4,}')
>>> [[x for x in num_re.split(d) if x] if num_re.search(d) else [] for d in data]
[[],
[', ', ', '],
[],
[],
['\xa0, ', ', ', ', ', ', '],
['/ '],
[],
['/ ', '/> '],
['/', '/ ', '/ '],
['\n\nAccepted: ', '\nRejected: ']]
CodePudding user response:
If you want to extract the ids, you could use for example:
import re
data = [
'13796352',
'2113146, 2113148, 2113147',
'asdf ee A070_321 on 4.3.99 - MC',
'blah blah3',
'1914844\xa0, 3310339, 1943270, 2190351, 1215262',
'789702/ 89057',
'1 of 5 blah blah',
'688327/ 6712563/> 5425153',
'1820196/1964143/ 249805/ 300510',
'731862\n\nAccepted: 176666\nRejected: 8787'
]
for el in data:
print(re.findall(r'(?<!\d)\d{4,10}(?!\d)', el))
Resulting in:
['13796352']
['2113146', '2113148', '2113147']
[]
[]
['1914844', '3310339', '1943270', '2190351', '1215262']
['789702', '89057']
[]
['688327', '6712563', '5425153']
['1820196', '1964143', '249805', '300510']
['731862', '176666', '8787']
(?<!\d)\d{4,10}(?!\d)
means match a sequence of 4 to 10 digits that is not preceded or followed by a digit.