I am trying to get the words that start with a capital letter regardless of whether it has a special character or not in the word. Currently, my pattern only gets capital letters without accents.
I don't need numbers or hyphens, just accents or special characters in the letters.
pattern = r"\b[A-Z][a-z]*\b"
name = soup.select('h1.data-header__headline-wrapper')[0].text.strip()
name = re.findall(pattern, name)
name = " ".join(name)
CodePudding user response:
You need to pip install regex
in your console and then use
import regex
pattern = r"\b\p{Lu}\p{Ll}*\b"
name = soup.select('h1.data-header__headline-wrapper')[0].text.strip()
name = regex.findall(pattern, name)
name = " ".join(name)
Here,
\b
- a word boundary\p{Lu}
- an uppercase letter\p{Ll}*
- zero or more lowercase letters.
CodePudding user response:
If you want to use the core "re" module in Python, one option is to add to the list all Unicode letters you expect which are not in the range A-Z. Also, add the re.UNICODE
flag to findall() function to allow for UNICODE characters.
For example:
s = "Ébc Ánna Apple Cámara Corazón Señor"
name = re.findall(r"(\b[A-ZÀÁÉ]\S*\b)", s, re.UNICODE)
print(name)
Output:
['Ébc', 'Ánna', 'Apple', 'Cámara', 'Corazón', 'Señor']