Finding unknown non-English characters in a text file (python)-CodePudding

Suppose that we have a text file loading with:

file = open('my_file.txt',mode='r')
stg = file.read()

This file contains some non-English unknown characters. These characters may have different forms like Á, î, Ç, etc. How can I extract these characters with their location in the text file? So the output is the list of these characters with their locations (line number).

CodePudding user response：

So assuming you wan't to find all non-[english, number, punctuation, backslash] characters you can use the following code to find all positions and numbers

[(match.start(0), match.group()) for match in re.finditer(f'[^a-zA-Z0-9{string.punctuation}\\\]', stg)]

Using example

ÁbxcsdasîîîîîîîîîîîîÇÇadasda/.1.32131.!#@%$%&*^()|\}}"?>:{}?><<"

It will return

[(0, 'Á'), (8, 'î'), (9, 'î'), (10, 'î'), (11, 'î'), (12, 'î'), (13, 'î'), (14, 'î'), (15, 'î'), (16, 'î'), (17, 'î'), (18, 'î'), (19, 'î'), (20, 'Ç'), (21, 'Ç')]

CodePudding user response：

This is the code which I used for one of my projects. It doesn't check for punctuations and special characters.

file = open('test.txt',mode='r')
lines = file.readlines()

def isEnglishChar(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

for index, value in enumerate(lines):
    for i in range(0, len(value)):
        bool = isEnglishChar(value[i])
        if(not bool):
            print (value[i], index 1)

CodePudding user response：

ASCII characters have Unicode values between 0 and 127. Any character with a Unicode value greater than 127 is not ASCII.

with open(filename) as fp:
    for lineno, line in enumerate(fp, start=1):
        for ch in line:
            if ord(ch) > 127:
                print(lineno, ch)

CodePudding user response：

with open("testfile.txt", 'w') as f_out:
    test_text= '''
    This file contains some non-English unknown characters. 
    These characters may have different forms like Á, 
    î, Ç, etc. How can I extract these characters with their location in the text file
    '''
    f_out.write(test_text)
with open("testfile.txt") as fp:
    for lineno, line in enumerate(fp, start=1):
        ch_count = 0
        for ch in line:
            ch_count  = 1
            if ord(ch) > 127:
                print(f'{lineno=}\tCharacter Number={ch_count}\t {ch=}')

Output

lineno=3    Character Number=52  ch='Á'
lineno=4    Character Number=5   ch='î'
lineno=4    Character Number=8   ch='Ç'