I´m in front of a problem and I´ve been trying to solve it for several hours now. I tried to find different ways to proceed, but Panda seemed to be the most efficient...
I have a table with this format:
category1 subcat1 10
category1 subcat2 15
category1 subcat3 1
category1 subcat4 23
category2 subcat1 10
category2 subcat2 99
category2 subcat3 23
category2 subcat4 12
category2 subcat5 10
category2 subcat6 11
category2 subcat7 45
category2 subcat8 105
category3 subcat1 2
category3 subcat2 9
category3 subcat3 14
category3 subcat4 25
... ... ...
I´m trying to get a result from this form (max3 subcat values per category):
category1:
subcat1 - 10
subcat2 - 15
subcat3 - 10
category2:
subcat2 - 99
subcat7 - 45
subcat8 - 105
category3:
subcat2 - 9
subcat3 - 14
subcat4 - 25
actually, I wrote this code:
import pandas as pd
data_url = 'pathtomydata.csv'
score_rating = pd.read_csv(data_url, sep=';', usecols=['category','subcategory', 'value'])
score_rating.groupby(['category','subcategory', 'value'])
print(score_rating.to_string)
I can see the content of Score_rating, but I can´t get it ordered by category, and more importantly, I can´t get the max 3 values with their subcategory by category. I tried with .head(3), but it doesn´t delivers the max values, just the 3 first ones. I tried as well .top(3), but same issue. With .max(), I succeed to get ONE max value by category/Subcategory, but I need 3.
Has anybody some experience with a similar problem, or an idea, a workaround how I could get that result?
Many thanks in advance.
I´m fairly new here, hope for your understanding while reading this explaination. :)
CodePudding user response:
Here is work around, I still think there is a better way of doing it, but this should help you out for now:
result = pd.DataFrame()
for item in df['category'].unique():
result = result.append(df[df.category == item].nlargest(3, 'value'))
Basically you use the nlargest
method on a sub-dataframe for each category, in a kind of manual groupby function, because it is not supported on groupby objects.
CodePudding user response:
As for your demo data:
df = pd.DataFrame({'col1': ['category1', 'category1','category1', 'category1', 'category2', 'category2','category2', 'category2', 'category2', 'category2','category2', 'category2', 'category3', 'category3','category3', 'category3'],
'col2': ['subcat1', 'subcat2', 'subcat3', 'subcat4', 'subcat1', 'subcat2', 'subcat3', 'subcat4', 'subcat5', 'subcat6', 'subcat7', 'subcat8', 'subcat1', 'subcat2', 'subcat3', 'subcat4'],
'col3': [10, 15, 1, 23, 10, 99, 23, 12, 10, 11, 45, 105, 2, 9, 14, 25]})
print(df)
col1 col2 col3
0 category1 subcat1 10
1 category1 subcat2 15
2 category1 subcat3 1
3 category1 subcat4 23
4 category2 subcat1 10
5 category2 subcat2 99
6 category2 subcat3 23
7 category2 subcat4 12
8 category2 subcat5 10
9 category2 subcat6 11
10 category2 subcat7 45
11 category2 subcat8 105
12 category3 subcat1 2
13 category3 subcat2 9
14 category3 subcat3 14
15 category3 subcat4 25
Here is the solution offered by joris in another question:
df_agg = df.groupby(['col1', 'col2']).agg({'col3': sum})
a = df_agg['col3'].groupby('col1', group_keys=False).nlargest(3)
print(a)
col1 col2
category1 subcat4 23
subcat2 15
subcat1 10
category2 subcat8 105
subcat2 99
subcat7 45
category3 subcat4 25
subcat3 14
subcat2 9