Python counter unexpected result for most_common(). Is it a bug in python?-CodePudding

I came across unexpected result. And I do not understand why this happens when I use collections.Counter

I use python 3.8

from collections import Counter
counter = Counter()
counter["تمباکو"] =  1
print(counter.most_common())

Output:

    [('تمباکو', 1)]

According to the documentation it should returns (keywords, count) pair

When I try to write to csv the output of counter.most_common() it also changes the order of the data:

writer = csv.writer(f)
writer.writerows(counter.most_common())

it outputs in rows pairs (count, keyword)

but when you run:

counter.most_common()[0][0]

it will output:

'تمباکو'

and it looks like everything is fine, because keywords is first.

Something is wrong and I do not understand it.

CodePudding user response：

To elaborate on my comment:

It's not Python, it's your input.

Here's a synthetic example that has a string including U 202E RIGHT-TO-LEFT OVERRIDE (which, humorously, affects the rendering on that linked page too).

from collections import Counter

s = "\u202Ehello"
c = Counter()
c[s]  = 1

for word, count in c.most_common():
    print(word, count)

When I run this, my terminal shows

‮hello 1

since the 202E character overrides rendering order.

If I remove the 202E character, I get

hello 1

as expected.

A way to print strings that have such override characters in a "de-fanged" way is to use repr() (with its own caveats, of course):

for word, count in c.most_common():
    print(repr(word), count)

prints out

'\u202ehello' 1

since the offending control character is escaped.