Print part of the text-CodePudding

I have a variable called test and inside the variable strings with the same language should be printed, for example: test = "Hello World سلام دنیا"

test = "Hello World سلام دنیا" I want it to print only the sentences that are written in Farsi

I should not use regex because the sentence is random and the number of words is unknown

`a = "Hello سلام".replace("H","").replace("e","").replace("l","").replace("o","")

print(a)`

CodePudding user response：

You can use the ASCII values of the letters in the sentence to differentiate English alphabet values from other alphabets. You can use ord(character) to find the ASCII value of the respective character.

CodePudding user response：

The challenge her is how best to determine if a word is comprised entirely of characters that are valid Farsi (Persian).

The are 106 valid characters in Persian. Many are common to other languages.

The characters can be represented by the following set:

farsi_characters = {32, 33, 36, 37, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 60, 61, 62, 91, 92, 93, 94, 95, 124, 160, 169, 171, 187, 1545, 1548, 1563, 1567, 1569, 1570, 1571, 1572, 1574, 1575, 1576, 1577, 1578, 1579, 1580, 1581, 1582, 1583, 1584, 1585, 1586, 1587, 1588, 1589, 1590, 1591, 1592, 1593, 1594, 1601, 1602, 1604, 1605, 1606, 1607, 1608, 1611, 1612, 1613, 1617, 1620, 1642, 1643, 1644, 1662, 1670, 1688, 1705, 1711, 1740, 1776, 1777, 1778, 1779, 1780, 1781, 1782, 1783, 1784, 1785, 8206, 8208, 8209, 8230, 8240, 8249, 8250, 8364, 8722}

This means that the problem can be partially solved as follows:

def is_farsi(word):
    return all(ord(c) in farsi_characters for c in word)

test = "Hello World سلام دنیا"

for word in test.split():
    if is_farsi(word):
        print(word)

Output:

سلام
دنیا

Note:

The problem here is ambiguity. What if we have:

test = "Hello 123 World سلام دنیا"

Then the output would be:

123
سلام
دنیا

Why? Well, it's because the Arabic numbers 0-9 are also used in Farsi in addition to ۰۱۲۳۴۵۶۷۸۹

You could consider removing values lower than 1545 from the set. This will eliminate many of the characters that are common in other languages