Extract range of Arabic letters in python-CodePudding

I have a string of data that has digits, English, and Arabic text, and I want only to extact the Arabic letters. The structure is a little bit difficult,

@service:Card Issuance
البريد يوصل 
لي البطاقة ؟
09/21/2022 @ 2:43 PM
Open conversation
#CMP_Cards_Lost
@closureReason:لم تعد هناك حاجة إلى البطاقة
@cardchoice:البطاقة 4
@cardchoice:البطاقة 2
2 more...
عادي اطلب البطاقة 
عن طريق البريد
09/21/2022 @ 2:43 PM
Open conversation
#FAQ_Request_Card_Delivery
@service:Card Delivery
شلون طريقه تحديث 
البيانات
09/21/2022 @ 2:43 PM
Open conversation
#NVG_SS_UpdatingData
@data:البيانات
احتاج احدث 
البيانات
 الشخصية
09/21/2022 @ 2:43 PM
Open conversation
#NVG_SS_UpdatingData
@data:البيانات
كيف احدث 
البيانات

I tried a few things like:

 print (' ' .join(re.findall('[\u0600-\u06FF] ', str(n))))

but it doesn't work as I wanted.

The output that I want could be a list, a data frame, or another suitable structure.

"البريد يوصل لي البطاقة " , " لم تعد هناك حاجة الى البطاقة" , " شلون طريقة تحديث البيانات" , "احتاج احدث البيانات الشخصية "

and so on.

CodePudding user response：

[Note: I neither speak nor read Arabic, so my solution could be incomplete.]

Use unicodedata to extract Arabic characters (and others you want to keep).

import unicodedata

txt = """@service:Card Issuance
البريد يوصل 
لي البطاقة ؟
09/21/2022 @ 2:43 PM
Open conversation
#CMP_Cards_Lost
@closureReason:لم تعد هناك حاجة إلى البطاقة
@cardchoice:البطاقة 4
@cardchoice:البطاقة 2
2 more...
عادي اطلب البطاقة 
عن طريق البريد
09/21/2022 @ 2:43 PM
Open conversation
#FAQ_Request_Card_Delivery
@service:Card Delivery
شلون طريقه تحديث 
البيانات
09/21/2022 @ 2:43 PM
Open conversation
#NVG_SS_UpdatingData
@data:البيانات
احتاج احدث 
البيانات
 الشخصية
09/21/2022 @ 2:43 PM
Open conversation
#NVG_SS_UpdatingData
@data:البيانات
كيف احدث 
البيانات"""

# Additional characters to keep
keep = " "

origlines = txt.splitlines()
outlines = []
for ln in origlines:
    cleaned = "".join(
        [c for c in ln if "ARABIC" in unicodedata.name(c) or c in keep]
    )
    if cleaned and not cleaned.isspace():
        outlines.append(cleaned.strip())

for oln in outlines:
    print(oln)

This produces:

البريد يوصل
لي البطاقة ؟
لم تعد هناك حاجة إلى البطاقة
البطاقة
البطاقة
عادي اطلب البطاقة
عن طريق البريد
شلون طريقه تحديث
البيانات
البيانات
احتاج احدث
البيانات
الشخصية
البيانات
كيف احدث
البيانات

CodePudding user response：

Your regex fails to include spaces.

print(re.findall('(?!\s)[\s\u0600-\u06FF] ', str(n)))

You haven't revealed what n is; I'm guessing you can probably take out the str() too.

The (?!\s) lookahead is a minor tweak to avoid having the match start on a newline or other whitespace.

Demo: https://ideone.com/P7MXEw