I am trying to extract sequences of words containing at least 1 item of the following:
- Uppercase character.
- Digit
- ':' or '-'
For example for the following phrase:
- aBC has been contacting Maria and James where their DDD Code for system DB-54:ABB is 12343-4.
I would like to extract the following items:
- aBC
- Maria
- James
- DDD Code
- DB-54:ABB
- 12343-4
So far, I have the following code:
import re
re.findall(r'((\S*[A-Z|0-9|\:|\-]\w*)([\, |\.])?)', 'aBC has been contacting Maria and ere our DDD Code for system DB-54:ABB is 12343-4.')
Which returns:
[('aBC ', 'aBC', ' '),
('Maria ', 'Maria', ' '),
('DDD ', 'DDD', ' '),
('Code ', 'Code', ' '),
('DB-54:ABB ', 'DB-54:ABB', ' '),
('12343-4.', '12343-4', '.')]
This returns all of the desired items except that it is splitting DDD and Code. My goal is to group together consecutive words containing the items mentioned above. 'DDD' 'Code' both contain a capital letter and are consecutive, therefore they should belong to the same string
CodePudding user response:
You could add
to repeat the pattern. I simplified it some since you used backslashes where it's not needed. This will result in the 6 capture groups you want:
((\S*[A-Z0-9:-]\w*)($|[ ,.]))
CodePudding user response:
This doesn't split consecutive matches
result = re.findall(r'(?:[\w0-9]*[A-Z0-9\-:] [\w0-9]*\s*) ', text)
But you may have to strip the whitespaces
map(str.strip, result)