Odd little problem here,
I have this (random) sentence in Bengali : "তিনি কবিতা প্রিয়, সুগঠিত স্বাস্থ্যের অধিকারী।"
I tried to run a regex on it (using Python re library
) like this :
- সুগঠিত ("token #4") :
re.search(r"\bসুগঠিত\b", "তিনি কবিতা প্রিয়, সুগঠিত স্বাস্থ্যের অধিকারী।") : <re.Match object; span=(19, 25), match='সুগঠিত'>
- কবিতা ("token #2"):
re.search(r"\bকবিতা\b", "তিনি কবিতা প্রিয়, সুগঠিত স্বাস্থ্যের অধিকারী।"): None
Any idea why this might be happening?
UPDATE (from answer suggestions below) :
- Check out the Diacritics used in Bengali (and other Indic languages)
CodePudding user response:
If you check what chars your কবিতা
consists of (I like to use this service), you will learn that the last letter is a U 09BE
, that is a BENGALI VOWEL SIGN AA
that belongs to the Mc (Mark, spacing combining) Unicode category.
Note that Mc Unicode category chars does not belong to the word chars in re
regex. Python re
\w
matches "Unicode letter, ideogram, digit, or underscore", where "ideogram" refers to the Mn (Mark, Nonspacing) Unicode category only.
The last \b
word boundary in your regex requires either the end of string, or a non-word char immediately after the AA
vowel, because the word boundary appears right after a non-word AA
char.
Thus, if you need to add all combining marks into the word boundary, you would need to use the PyPi regex library where the issue is fixed:
Definition of 'word' character (issue #1693050)
The definition of a 'word' character has been expanded for Unicode. It conforms to the Unicode specification at
http://www.unicode.org/reports/tr29/
.
See the Python demo online:
import regex
print( regex.search(r"\bকবিতা\b", "তিনি কবিতা প্রিয়, সুগঠিত স্বাস্থ্যের অধিকারী।") )
# => <regex.Match object; span=(5, 10), match='কবিতা'>