Home > OS >  Extract Latin numbers from Arabic text
Extract Latin numbers from Arabic text

Time:04-27

I have a substring of Arabic text with Latin numerals in it such as:

text = قيمة بيع الدولار 550

I need to extract the number from the text, but I'm struggling with the regex that would work for this. I think the fact that the numbers are going left to right and letters going right to left are causing me issues...I'm admittedly not well versed in regex so I'm hoping there's just a trick to it I'm missing it. Here are a couple things I've tried:

re.findall(r'قيمة بيع الدولار \d ', text)
re.findall(r'\d  قيمة بيع الدولار', text)

Both of these return empty lists.

If I search for simply re.findall(r'\d ', text) it does successfully return a list of all the numbers in the text, so I'm pretty sure it's issue to do with searching for BOTH Arabic and Latin in the same string.

The full text of what I'm searching looks something like this below, so if I search ONLY for the numbers, it returns stuff I don't need/want. I also need to be able to differentiate between the numbers that are identified as "الدولار" versus "اليورو". There are NO newline characters in the text.

Text =    "ها هي قيم العملة يوم 4/2/2022 الساعة 9:00:
    
    قيمة بيع الدولار 550
    قيمة بيع اليورو 600
    قيمة شراء الدولار 700
    قيمة شراء اليورو 701"

x = re.findall(r'\d ', text)

returns

x = ['4', '2', '2022', '9', '00', '550', '600', '700', '701']

Edit: In this case, I do NOT want to have a list with 4, 2, 2022, 9, 00. I can usually count on the numbers I do want to be in the same order, but not always. I also need to be certain which number is associated with which set of text because the text contains information about what currency the number is for (roughly translated the first line is "the value to sell the dollar is 550")

CodePudding user response:

Definitely understand what you're saying with mixing right-to-left and left-to-right.

The below seems to work (highlighting bugged, but matches on the right are as expected). Since you want to differentiate them anyway, how about 2 separate regexes?

الدولار.(\d )
اليورو.(\d )

https://regex101.com/r/hXYNk2/1

CodePudding user response:

s = Text.split()
numl = [num for num in s if num.isnumeric()]

This makes a list of "words" separated out by each space, then a list of the numbers. So don't have to use regex.

  • Related