How to clean non Arabic letters from a text file in python?-CodePudding

UPDATE- Very new to python, How to clean the text from everything but Arabic letters. I used regex function but without success.

This is my code

# load text
filename = '/content/drive/MyDrive/Colab Notebooks/ArabicKidsStories.txt'
file = open(filename,'rt')
text = file.read()
file.close()
import re
text = re.sub('([@A-Za-z0-9_] )|[^\w\s]|#|http\S ', '', text) # cleaning up
print (text)

This is a sample of the output

 تفقدت نظارتي  حين استيقظت صباحا  فلم أجدها في مكانها  وبحثت عنها في كل مكان  دون أن أعثر لها على أثر  يا إلهي  كيف سأخرج اليوم من البيت  وأواجه النهار  
 وتناهى إلي من الخارج  صوت نقار الخشب  فوق جذع شجرة قريبة فأسرعت إلى الباب  وفتحته  وإذا ضوء النهار يبهر بصري  فأغلقت عيني  وهتفت  أيها النقار  أين أنت  
 وحاولت عبثا أن أفتح عيني  وأنا أقول  عفوا  لا أستطيع أن أفتح عيني  إن الضوء يعميني  
 فقال نقار الخشب  هذا طبيعي  يا عزيزتي  فأنت لم تضعي نظارتك الشمسية  
 وتراجعت قليلا  وقلت  لقد اختفت نظارتي  
 فتساءل نقار الخشب  اختفت  ماذا تقولين  
 وبدل أن أجيبه  قلت  أرجوك  ابحث لي عن نظارتي  إنني لا أستطيع الخروج من دونها  
 ولاذ نقار الخشب لحظة  ثم قال  حسن  ابقي أنت في البيت  وسأبحث لك أنا عنها  
 ومضى نقار الخشب  فأغلقت الباب والنافذة  وقبعت في الظلام  يا للغرابة  إنني أرى في الليل أيضا  أوه  كلا  إنني أحب النهار  وأحبذ أن أطير دوما في النور مع رفاقي  إنني لا أحب الليل  ولا أريد أن يكون الظلام عالمي  ترى أين اختفت هذه النظارة اللعينة  
    
 ـــــــــــــ 
 عاد نقار الخشب متعبا  قبل المساء  وقال لي  آسف  يا عزيزي  سألت عن نظارتك الطيور جميعا  لكن أحدا منهم لم يرها  
 فأطرقت برأسي برهة  ثم قلت  أشكرك  يا عزيزي  سأبحث عنها بنفسي ليلا  
 واتسعت عينا نقار الخشب دهشة  وقال  ليلا  
 وقبل أن أجيبه  مضى على عجل  وهو يقول  عفوا  صغاري ينتظرونني الآن  إلى اللقاء

Any help will be appreciated. Thanks in advance.

CodePudding user response：

Try this:

text = re.sub('[a-zA-Z0-9_]|#|http\S ', '', text)

I just remove the [^\w\s] and it removes all alphanumeric underscores without removing the Arabic text

CodePudding user response：

As far as I understood you. You want just to clean non-arabic chars (so chars like 1 @ ? gonna not be deleted).

If you want another char to be deleted just add it to charsotdelete.

If you have any questions let me know.

charstodelete = 'azertyuiopqsdfghjklmwxcvbn'
filename = '/content/drive/MyDrive/Colab Notebooks/ArabicKidsStories.txt'
file = open(filename,'r')
text = file.read()
file.close()
output_text = ''

# It's all about this
for char in text:
    if char in charstodelete or char in chartodelete.upper():
        continue
    else:
        output_text  = char


outputfile = open('/content/drive/MyDrive/Colab Notebooks/output.txt','w')
outputfile.write(output_text)
outputfile.close()

EDIT: Reg Ex are a bit of pain to get use to them if you are a beginner as you said. I recommend to use a code like that instead of Reg Ex

CodePudding user response：

You have two problems with your regular expression. First, the "@" symbol must be in the [range], not outside. Second, you have a wrong type of dash between 0 and 9. Here is a corrected expression. It works:

'([@A-Za-z0-9_ـــــــــــــ] )|[^\w\s]|#|http\S '

CodePudding user response：

Instead of trying to remove non-Arabic characters we can find Arabic characters by their character codes. The Arabic unicode block is codes from 0x0600 - 0x06ff.

Here that is as regular expression to find all of the words:

import re

arabic_words = re.findall('[\u0600-\u06ff] ', input_text)

print(arabic_words)

The expression here is using a character range with one or more (the ). This should give you a list of the words.

If the text above is arranged into sentences you could so something similar after splitting the text appropriately to get the sentences together.