I'm trying to solve this in regex, I have to remove any words with Arabic vowels which are [و ا ي], and need to add rules like when it's at the beginning of the word, don't match and when it's after this pattern [ال] also don't match. How to apply these rules in regex?
This is a sample of my file
تفقدت نظارتي حين استيقظت صباحا فلم اجدها في مكانها وبحثت عنها في كل مكان دون ان اعثر لها علي اثر يا الهي كيف ساخرج اليوم من البيت واواجه النهار
وتناهي الي من الخارج صوت نقار الخشب فوق جذع شجره قريبه فاسرعت الي الباب وفتحته واذا ضوء النهار يبهر بصري فاغلقت عيني وهتفت ايها النقار اين انت
وحاولت عبثا ان افتح عيني وانا اقول عفوا لا استطيع ان افتح عيني ان الضوء يعميني
This is my code so far,
result_novowels = re.findall(r'\b(?:(?![اوي])) \b', text_norm, re.I)
print(result_novowels)
This is the output
['', '', '', '', '', '', '', '', '', '', '', '']
I'm not very experienced with regex and I've been unable to find anything about how to do this online, so it'd be awesome if someone with more experience could help me out. Thanks!
CodePudding user response:
EDIT: The simple solution proposed by questioner is:
\b[^اويى\W] \b
Apparently a general form of a Unicode character could capture other contextual form of it (at least for Arabic in Python).
It is not complete but you could add desired and non-desired Unicode to it.
\b((?:(?!(\u0627|\u064A|\u0648))(?:[\u0600-\u06FF])) ?)\b
So We first ensure the left boundary of the word to be either \s
or start of the string \A
and consume the \s
if any exist. Then we capture set of Unicode characters [\u0600-\u06FF]
as long as they don't belong to vowels (?!(\u0627|\u064A|...))
until we reach the end of the word by looking at either \s
or end of the string $
.
add all vowels from here