I'm reading a csv file by Pandas pd.read_csv
. I want to remove all columns but Mã NPP
. However, the string Mã NPP
I input from keyboard is not the same as the one in the column names of the dataframe. As such, I could not subset the dataframe. I guess this problem is due to how non-English characters are encoded.
Could you elaborate how to make the string I input from keyboard "identical" to that from the dataframe?
import pandas as pd
df = pd.read_csv(r'https://raw.githubusercontent.com/leanhdung1994/data/main/DIST_INVOICE_RETURN.csv', encoding = 'utf8', header = 0, nrows = 1)
x = 'Mã NPP'
y = df.columns[0]
print(x, '\n', y)
x == y
The result is
Mã NPP
Mã NPP
False
CodePudding user response:
You guessed right. Consider these two characters:
a = "ã"
b = "ã"
print(a == b)
print(len(a), len(b))
print(list(a), list(b))
output:
False
1 2
['ã'] ['a', '̃']
looked similar but they are not equal to each other.
This is happening because these two strings use different unicode encodings to define each character.
import unicodedata
a = "ã"
b = "ã"
print([unicodedata.name(c) for c in a])
print([unicodedata.name(c) for c in b])
output:
['LATIN SMALL LETTER A WITH TILDE']
['LATIN SMALL LETTER A', 'COMBINING TILDE']
You can even see this in Decomposition field of : https://www.compart.com/en/unicode/U 00E3
To avoid this you need to do something called unicode normalization:
Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U 00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U 0043 (LATIN CAPITAL LETTER C) U 0327 (COMBINING CEDILLA).
After that you can compare them without problem:
import unicodedata
a = "ã"
b = "ã"
normalized_a = unicodedata.normalize("NFD", a)
normalized_b = unicodedata.normalize("NFD", b)
print(normalized_a, normalized_b)
print(normalized_a == normalized_b)
So maybe you want to create your own function and wrap this helper function and use it instead of ==
:
import unicodedata
a = "Mã NPP"
b = "Mã NPP"
print(list(a))
print(list(b))
def compare_strings(s1, s2):
s1_normalized = unicodedata.normalize("NFD", s1)
s2_normalized = unicodedata.normalize("NFD", s2)
return s1_normalized == s2_normalized
print(a == b)
print(compare_strings(a, b))
output:
['M', 'ã', ' ', 'N', 'P', 'P']
['M', 'a', '̃', ' ', 'N', 'P', 'P']
False
True
CopyRight: This example was from this article. The writer took one step further and did a case-insensitive comparison between strings in case you are interested.