Home > Enterprise >  Regex - Find successive 'words' containing at least 1 capital letter, one digit or one spe
Regex - Find successive 'words' containing at least 1 capital letter, one digit or one spe

Time:10-10

I am trying to extract sequences of words containing at least 1 item of the following:

  • Uppercase character.
  • Digit
  • ':' or '-'

For example for the following phrase:

  • aBC has been contacting Maria and James where their DDD Code for system DB-54:ABB is 12343-4.

I would like to extract the following items:

  • aBC
  • Maria
  • James
  • DDD Code
  • DB-54:ABB
  • 12343-4

So far, I have the following code:

import re
re.findall(r'((\S*[A-Z|0-9|\:|\-]\w*)([\, |\.])?)', 'aBC has been contacting Maria and ere our DDD Code for system DB-54:ABB is 12343-4.')

Which returns:

[('aBC ', 'aBC', ' '),
 ('Maria ', 'Maria', ' '),
 ('DDD ', 'DDD', ' '),
 ('Code ', 'Code', ' '),
 ('DB-54:ABB ', 'DB-54:ABB', ' '),
 ('12343-4.', '12343-4', '.')]

This returns all of the desired items except that it is splitting DDD and Code. My goal is to group together consecutive words containing the items mentioned above. 'DDD' 'Code' both contain a capital letter and are consecutive, therefore they should belong to the same string

CodePudding user response:

You could add to repeat the pattern. I simplified it some since you used backslashes where it's not needed. This will result in the 6 capture groups you want:

((\S*[A-Z0-9:-]\w*)($|[ ,.])) 

Demo

CodePudding user response:

This doesn't split consecutive matches

result = re.findall(r'(?:[\w0-9]*[A-Z0-9\-:] [\w0-9]*\s*) ', text)

But you may have to strip the whitespaces

map(str.strip, result)
  • Related