Home > front end >  pandas return top n categorical field for each type for each year
pandas return top n categorical field for each type for each year

Time:03-10

Thanks for taking the time to go through this! :) I have a data frame that looks like this

z=pd.DataFrame.from_dict({'id':[1,2,3,4,5,6,7,8,9,10],
                        'name':['ax','bx','ayx','byz','ck','cl','akl','bwo','cop','bab'],
                        'value':[45,32,64,12,90,121,34,56,78,76],
                        'type':['x','y','x','y','z','z','y','z','x','z'],
                         'year':[1997,1198,1999,1999,1998,1997,1997,1998,1997,1997]})

I can find the values of top 2 types for each year using

z.groupby(['year','type'])['value'].nlargest(2)

How can I find the Top 2 names for each type of each year in the dataset that has highest value?

this is how the output will resemble

year category value name
1997   a       45    ax
       a       34    akl
       b       76    bab
       b       NULL  NULL
       c       121   cl
       c       78    cop
        

value for 1997 -> b is null since there is only one value for b in dataset but we need top 2

CodePudding user response:

I believe you're looking for z.groupby(['year','type'])[['year', 'value']].value_counts() which returns a hash table of the counts. You can grab the top to records from that.

You may have to call .reset_index() on your dataframe to unindex your key.

CodePudding user response:

Based on your example DataFrame:

z.sort_values(["year","type","value"], ascending=[1,1,0])\
.groupby(["year","type"]).head(2).reset_index()

 index id name value type year
0   1   2   bx  32   y  1198
1   8   9   cop 78   x  1997
2   0   1   ax  45   x  1997
3   6   7   akl 34   y  1997
4   5   6   cl  121  z  1997
5   9   10  bab 76   z  1997
6   4   5   ck  90   z  1998
7   7   8   bwo 56   z  1998
8   2   3   ayx 64   x  1999
9   3   4   byz 12   y  1999

Now for each year and type you have top 2 names. For top-n you may change value inside head() method.

  • Related