Home > database >  Regex pattern with accented characters
Regex pattern with accented characters

Time:08-09

I am trying to get the words that start with a capital letter regardless of whether it has a special character or not in the word. Currently, my pattern only gets capital letters without accents.

I don't need numbers or hyphens, just accents or special characters in the letters.

pattern = r"\b[A-Z][a-z]*\b"
name = soup.select('h1.data-header__headline-wrapper')[0].text.strip()
name = re.findall(pattern, name)
name = " ".join(name)

CodePudding user response:

You need to pip install regex in your console and then use

import regex
pattern = r"\b\p{Lu}\p{Ll}*\b"
name = soup.select('h1.data-header__headline-wrapper')[0].text.strip()
name = regex.findall(pattern, name)
name = " ".join(name)

Here,

  • \b - a word boundary
  • \p{Lu} - an uppercase letter
  • \p{Ll}* - zero or more lowercase letters.

CodePudding user response:

If you want to use the core "re" module in Python, one option is to add to the list all Unicode letters you expect which are not in the range A-Z. Also, add the re.UNICODE flag to findall() function to allow for UNICODE characters.

For example:

s = "Ébc Ánna Apple Cámara Corazón Señor"
name = re.findall(r"(\b[A-ZÀÁÉ]\S*\b)", s, re.UNICODE)
print(name)

Output:

['Ébc', 'Ánna', 'Apple', 'Cámara', 'Corazón', 'Señor']
  • Related