Hello i have the following pandas dataframe and I'm trying to find a way to sort the values that have the most distinct values for a certain field (ie. "source_name"):
import pandas as pd
df_test=pd.DataFrame()
df_test['hotel_name']=['hotel1','hotel1','hotel2','hotel2','hotel2','hotel3','hotel3','hotel4','hotel4','hotel4']
df_test['source_name']=['source1','source2','source1','source2','source3','source1','source1','source1','source2','source3']
At the moment, I'm doing the following but it's just counting the number of hotel names, whereas I want to sort by the number of distinct values for the "hotel_name" field.
output = df_test.groupby(['hotel_name']).apply(lambda x: x.to_json(orient='records'))
s = output.str.count("source_name").sort_values(ascending=False).index
My expected output would be a dict with the keys being in the following order:
hotel2, hotel4, hotel1, hotel3
CodePudding user response:
Use value_counts
:
>>> df.value_counts('hotel_name').index.tolist()
['hotel2', 'hotel4', 'hotel1', 'hotel3']
CodePudding user response:
So if need sorted columns with ascending order use:
#sort by counts
df = df_test.sort_values('hotel_name',key=lambda x: x.map(x.value_counts()),ascending=False)
#sort by unique counts
s = df_test.groupby('hotel_name')['source_name'].nunique()
df = df_test.sort_values('hotel_name', key=lambda x: x.map(s), ascending=False)
print (df)
hotel_name source_name
2 hotel2 source1
3 hotel2 source2
4 hotel2 source3
7 hotel4 source1
8 hotel4 source2
9 hotel4 source3
0 hotel1 source1
1 hotel1 source2
5 hotel3 source1
6 hotel3 source1
Then add sort=False
for avoid default sorting of new sorted DataFrame
:
out = df.groupby(['hotel_name'], sort=False).apply(lambda x: x.to_json(orient='records')).to_dict()
print (out)
{'hotel2': '[{"hotel_name":"hotel2","source_name":"source1"},{"hotel_name":"hotel2","source_name":"source2"},{"hotel_name":"hotel2","source_name":"source3"}]', 'hotel4': '[{"hotel_name":"hotel4","source_name":"source1"},{"hotel_name":"hotel4","source_name":"source2"},{"hotel_name":"hotel4","source_name":"source3"}]', 'hotel1': '[{"hotel_name":"hotel1","source_name":"source1"},{"hotel_name":"hotel1","source_name":"source2"}]', 'hotel3': '[{"hotel_name":"hotel3","source_name":"source1"},{"hotel_name":"hotel3","source_name":"source1"}]'}
CodePudding user response:
If you desire to have a dictionary as an output with the Hotel
as key and the values as the number of unique hotels then you should use OrderedDict
:
output = df_test.groupby(['hotel_name'])['source_name'].nunique().sort_values(ascending=False)
from collections import OrderedDict
OrderedDict(output.to_dict())
Which outputs:
OrderedDict([('hotel4', 3), ('hotel2', 3), ('hotel1', 2), ('hotel3', 1)])
If you only want the hotel
values:
output.index
Outputs:
Index(['hotel4', 'hotel2', 'hotel1', 'hotel3'], dtype='object', name='hotel_name')