I would like to find all words in a text that have more than one uppercase letter. So far, I am checking only if the last character is uppercase
\b.*[A-Z]\b
but it would be more precise if I had the condition that the last letter or in total two characters in the word are uppercase.
CodePudding user response:
You can use
re.findall(r'\b(?:[a-z]*[A-Z]){2}[a-zA-Z]*\b', text)
See the regex demo. Details:
\b
- a word boundary(?:[a-z]*[A-Z]){2}
- two sequences of zero or more lowercase letters followed with an uppercase letter[a-zA-Z]*
- zero or more ASCII letters\b
- a word boundary
See the Python demo:
import re
text = "A VeRy LoNG SenTence Here"
print(re.findall(r'\b(?:[a-z]*[A-Z]){2}[a-zA-Z]*\b', text))
# => ['VeRy', 'LoNG', 'SenTence']
A fully Unicode-aware regex is possible with the PyPi regex
library (install in your terminal/console with pip install regex
):
import regex
text = "Да, ЭтО ОченЬ ДЛинное предложение."
print(regex.findall(r'\b(?:\p{Ll}*\p{Lu}){2}\p{L}*\b', text))
# => ['ЭтО', 'ОченЬ', 'ДЛинное']
See this Python demo.
CodePudding user response:
\b(\w*[A-Z]\w*[A-Z]\w*|.*[A-Z])\b
explanation: this will match either, any word with upper case at the end (your regex has been reused here) - OR - a string of zero or more word chars (\w), followed by a single uppercase, followed by a string of zero or more word chars (\w), followed by a single uppercase and finally another zero or more word chars. The \w is shorthand for [A-Za-z0-9_]