I came across unexpected result. And I do not understand why this happens when I use collections.Counter
I use python 3.8
from collections import Counter
counter = Counter()
counter["تمباکو"] = 1
print(counter.most_common())
Output:
[('تمباکو', 1)]
According to the documentation it should returns (keywords, count) pair
When I try to write to csv the output of counter.most_common() it also changes the order of the data:
writer = csv.writer(f)
writer.writerows(counter.most_common())
it outputs in rows pairs (count, keyword)
but when you run:
counter.most_common()[0][0]
it will output:
'تمباکو'
and it looks like everything is fine, because keywords is first.
Something is wrong and I do not understand it.
CodePudding user response:
To elaborate on my comment:
It's not Python, it's your input.
Here's a synthetic example that has a string including U 202E RIGHT-TO-LEFT OVERRIDE (which, humorously, affects the rendering on that linked page too).
from collections import Counter
s = "\u202Ehello"
c = Counter()
c[s] = 1
for word, count in c.most_common():
print(word, count)
When I run this, my terminal shows
hello 1
since the 202E character overrides rendering order.
If I remove the 202E character, I get
hello 1
as expected.
A way to print strings that have such override characters in a "de-fanged" way is to use repr()
(with its own caveats, of course):
for word, count in c.most_common():
print(repr(word), count)
prints out
'\u202ehello' 1
since the offending control character is escaped.