Home > Mobile >  Why are these two non-English strings (with exactly the same appearance) different in Python?
Why are these two non-English strings (with exactly the same appearance) different in Python?

Time:09-03

I'm reading a csv file by Pandas pd.read_csv. I want to remove all columns but Mã NPP. However, the string Mã NPP I input from keyboard is not the same as the one in the column names of the dataframe. As such, I could not subset the dataframe. I guess this problem is due to how non-English characters are encoded.

Could you elaborate how to make the string I input from keyboard "identical" to that from the dataframe?

import pandas as pd  
df = pd.read_csv(r'https://raw.githubusercontent.com/leanhdung1994/data/main/DIST_INVOICE_RETURN.csv', encoding = 'utf8', header = 0, nrows = 1)
x = 'Mã NPP'
y = df.columns[0]
print(x, '\n', y)
x == y

The result is

Mã NPP 
 Mã NPP
False

CodePudding user response:

You guessed right. Consider these two characters:

a = "ã"
b = "ã"
print(a == b)
print(len(a), len(b))
print(list(a), list(b))

output:

False
1 2
['ã'] ['a', '̃']

looked similar but they are not equal to each other.

This is happening because these two strings use different unicode encodings to define each character.

import unicodedata
a = "ã"
b = "ã"
print([unicodedata.name(c) for c in a])
print([unicodedata.name(c) for c in b])

output:

['LATIN SMALL LETTER A WITH TILDE']
['LATIN SMALL LETTER A', 'COMBINING TILDE']

You can even see this in Decomposition field of : https://www.compart.com/en/unicode/U 00E3

To avoid this you need to do something called unicode normalization:

Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U 00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U 0043 (LATIN CAPITAL LETTER C) U 0327 (COMBINING CEDILLA).

After that you can compare them without problem:

import unicodedata

a = "ã"
b = "ã"
normalized_a = unicodedata.normalize("NFD", a)
normalized_b = unicodedata.normalize("NFD", b)
print(normalized_a, normalized_b)
print(normalized_a == normalized_b)

So maybe you want to create your own function and wrap this helper function and use it instead of ==:

import unicodedata

a = "Mã NPP"
b = "Mã NPP"
print(list(a))
print(list(b))

def compare_strings(s1, s2):
    s1_normalized = unicodedata.normalize("NFD", s1)
    s2_normalized = unicodedata.normalize("NFD", s2)
    return s1_normalized == s2_normalized

print(a == b)
print(compare_strings(a, b))

output:

['M', 'ã', ' ', 'N', 'P', 'P']
['M', 'a', '̃', ' ', 'N', 'P', 'P']
False
True

CopyRight: This example was from this article. The writer took one step further and did a case-insensitive comparison between strings in case you are interested.

  • Related