Home > Enterprise >  Sort values that have the most distinct values in Pandas df
Sort values that have the most distinct values in Pandas df

Time:11-13

Hello i have the following pandas dataframe and I'm trying to find a way to sort the values that have the most distinct values for a certain field (ie. "source_name"):

import pandas as pd

df_test=pd.DataFrame()
df_test['hotel_name']=['hotel1','hotel1','hotel2','hotel2','hotel2','hotel3','hotel3','hotel4','hotel4','hotel4']
df_test['source_name']=['source1','source2','source1','source2','source3','source1','source1','source1','source2','source3']

At the moment, I'm doing the following but it's just counting the number of hotel names, whereas I want to sort by the number of distinct values for the "hotel_name" field.

output = df_test.groupby(['hotel_name']).apply(lambda x: x.to_json(orient='records'))

s = output.str.count("source_name").sort_values(ascending=False).index

My expected output would be a dict with the keys being in the following order:

hotel2, hotel4, hotel1, hotel3

CodePudding user response:

Use value_counts:

>>> df.value_counts('hotel_name').index.tolist()
['hotel2', 'hotel4', 'hotel1', 'hotel3']

CodePudding user response:

So if need sorted columns with ascending order use:

#sort by counts    
df = df_test.sort_values('hotel_name',key=lambda x: x.map(x.value_counts()),ascending=False)

#sort by unique counts   
s = df_test.groupby('hotel_name')['source_name'].nunique()
df = df_test.sort_values('hotel_name', key=lambda x: x.map(s), ascending=False)

print (df)
  hotel_name source_name
2     hotel2     source1
3     hotel2     source2
4     hotel2     source3
7     hotel4     source1
8     hotel4     source2
9     hotel4     source3
0     hotel1     source1
1     hotel1     source2
5     hotel3     source1
6     hotel3     source1

Then add sort=False for avoid default sorting of new sorted DataFrame:

out = df.groupby(['hotel_name'], sort=False).apply(lambda x: x.to_json(orient='records')).to_dict()
                                                  
print (out)
{'hotel2': '[{"hotel_name":"hotel2","source_name":"source1"},{"hotel_name":"hotel2","source_name":"source2"},{"hotel_name":"hotel2","source_name":"source3"}]', 'hotel4': '[{"hotel_name":"hotel4","source_name":"source1"},{"hotel_name":"hotel4","source_name":"source2"},{"hotel_name":"hotel4","source_name":"source3"}]', 'hotel1': '[{"hotel_name":"hotel1","source_name":"source1"},{"hotel_name":"hotel1","source_name":"source2"}]', 'hotel3': '[{"hotel_name":"hotel3","source_name":"source1"},{"hotel_name":"hotel3","source_name":"source1"}]'}

CodePudding user response:

If you desire to have a dictionary as an output with the Hotel as key and the values as the number of unique hotels then you should use OrderedDict:

output = df_test.groupby(['hotel_name'])['source_name'].nunique().sort_values(ascending=False)
from collections import OrderedDict
OrderedDict(output.to_dict())

Which outputs:

OrderedDict([('hotel4', 3), ('hotel2', 3), ('hotel1', 2), ('hotel3', 1)])

If you only want the hotel values:

output.index

Outputs:

Index(['hotel4', 'hotel2', 'hotel1', 'hotel3'], dtype='object', name='hotel_name')
  • Related