Home > Blockchain >  Pandas `value_counts()` and `unique()` result in different category orders
Pandas `value_counts()` and `unique()` result in different category orders

Time:04-14

For a given column, value_counts() function of pandas counts the number of occurrences of each value that this column takes. On the other hand, unique() function returns the unique values that occur at least once.

Now, just to given an example, take the mushroom dataset in the UCI Repository.

When I list the unique values in a particular column

df["class"].unique()

I get the output:

array(['p', 'e'], dtype=object)

However, when I count the number of occurrences

df["class"].value_counts()

I get the output:

e    4208
p    3916
Name: class, dtype: int64

Here, we can observe that the orders of the categories are different. The first one starts with 'p', whereas the second one starts with 'e'. I do not understand why there is such a mismatch, as one would typically assume the same order for consistency. I am wondering if there is any explanation for this, and whether there is a good practice to fix this. What comes to mind initially is that, I can count the occurrences by value_counts() and then instead of using the unique() function I can take the indices of the result. Namely:

val_counts = df["class"].value_counts()
val_unique = np.array(val_counts.index)
val_unique

Output:

array(['e', 'p'], dtype=object)

CodePudding user response:

pd.unique, np.unique, value_counts and groupby all have slightly different ordering rules. You can choose the one you want in order to get the desired ordering

import pandas as pd
import numpy as np

df = pd.DataFrame({'class': ['z', 'z', 'a', 'a', 'a', 'f', 'f', 'f', 'a', 'f', 'f']})

pd.unique

does not sort, output is ordered by first appearance

df['class'].unique()
#array(['z', 'a', 'f'], dtype=object)

np.unique

sorts the values

np.unique(df['class'])
#array(['a', 'f', 'z'], dtype=object)

value_counts

sorts based descending counts by default, can toggle to occurrence based

df['class'].value_counts()
#f    5
#a    4
#z    2
#Name: class, dtype: int64

df['class'].value_counts(sort=False)
#z    2
#a    4
#f    5
#Name: class, dtype: int64

groupby size

sorts based on label, can be toggled to sort based on occurence

# Sorts output based on grouping keys (i.e. labels)
df.groupby('class').size()
#class
#a    4
#f    5
#z    2
#dtype: int64

# Output ordered by occurrence of grouping keys
df.groupby('class', sort=False).size()
#class
#z    2
#a    4
#f    5
#dtype: int64

In your case, you want either value_counts with sort=False, or groupby size with sort=False

CodePudding user response:

value_counts() sorts by the counts by default. You can simply call

df['class'].value_counts().loc[df['class'].unique()]

to rearrange it back to the ordering from .unique()

  • Related