Home > Software design >  Regex: How to match words without Arabic vowels?
Regex: How to match words without Arabic vowels?

Time:10-10

I'm trying to solve this in regex, I have to remove any words with Arabic vowels which are [و ا ي], and need to add rules like when it's at the beginning of the word, don't match and when it's after this pattern [ال] also don't match. How to apply these rules in regex?

This is a sample of my file

تفقدت نظارتي  حين استيقظت صباحا  فلم اجدها في مكانها  وبحثت عنها في كل مكان  دون ان اعثر لها علي اثر  يا الهي  كيف ساخرج اليوم من البيت  واواجه النهار  
 وتناهي الي من الخارج  صوت نقار الخشب  فوق جذع شجره قريبه فاسرعت الي الباب  وفتحته  واذا ضوء النهار يبهر بصري  فاغلقت عيني  وهتفت  ايها النقار  اين انت  
 وحاولت عبثا ان افتح عيني  وانا اقول  عفوا  لا استطيع ان افتح عيني  ان الضوء يعميني

This is my code so far,

result_novowels = re.findall(r'\b(?:(?![اوي])) \b', text_norm, re.I)
print(result_novowels)

This is the output

['', '', '', '', '', '', '', '', '', '', '', '']

I'm not very experienced with regex and I've been unable to find anything about how to do this online, so it'd be awesome if someone with more experience could help me out. Thanks!

CodePudding user response:

EDIT: The simple solution proposed by questioner is:

\b[^اويى\W] \b

Demo

Apparently a general form of a Unicode character could capture other contextual form of it (at least for Arabic in Python).

It is not complete but you could add desired and non-desired Unicode to it.

\b((?:(?!(\u0627|\u064A|\u0648))(?:[\u0600-\u06FF])) ?)\b

So We first ensure the left boundary of the word to be either \s or start of the string \A and consume the \s if any exist. Then we capture set of Unicode characters [\u0600-\u06FF] as long as they don't belong to vowels (?!(\u0627|\u064A|...)) until we reach the end of the word by looking at either \s or end of the string $.

add all vowels from here

demo

  • Related