Home > database >  pandas groupby().head(n) where n is a function of group label
pandas groupby().head(n) where n is a function of group label

Time:10-08

I have a dataframe, and I would like to group by a column and take the head of each group, but I want the depth of the head to be defined by a function of the group label. If it weren't for the variable group sizes, I could easily do df.groupby('label').head(n). I can imagine a solution that involves iterating through df['label'].unique(), slicing the dataframe and building a new one, but I'm in a context where I'm pretty sensitive to performance so I'd like to avoid that kind of iteration if possible.

Here's an exmaple dataframe:

  label   values
0  apple       7
1  apple       5
2  apple       4
3    car       9
4    car       6
5    dog       5
6    dog       3
7    dog       2
8    dog       1

and code for my example setup:

import pandas as pd
df = pd.DataFrame({'label': ['apple', 'apple', 'apple', 'car', 'car', 'dog', 'dog', 'dog', 'dog'],
          'values': [7, 5, 4, 9, 6, 5, 3, 2 ,1]})
def depth(label):
    if label == 'apple': return 1
    elif label == 'car': return 2
    elif label == 'dog': return 3

my desired output is a dataframe with the number of rows from each group defined by that function:

   label  values
0  apple       7
3    car       9
4    car       6
5    dog       5
6    dog       3
7    dog       2

CodePudding user response:

I would use a dictionary here and using <group>.name in groupby.apply:

depth = {'apple': 1, 'car': 2, 'dog': 3}

out = (df.groupby('label', group_keys=False)
         .apply(lambda g: g.head(depth.get(g.name, 0)))
       )

NB. if you really need a function, you can do the same with a function call. Make sure to return a value in every case.

Alternative option with groupby.cumcount and boolean indexing:

out = df[df['label'].map(depth).gt(df.groupby('label').cumcount())]

output:

   label  values
0  apple       7
3    car       9
4    car       6
5    dog       5
6    dog       3
7    dog       2

CodePudding user response:

Another possible solution, based on GroupBy.get_group, groupby.ngroups and groups.keys:

g = df.groupby('label')
pd.concat([g.get_group(x[0]).head(x[1] 1)
          for x in zip(g.groups.keys(), range(g.ngroups))])

Output:

   label  values
0  apple       7
3    car       9
4    car       6
5    dog       5
6    dog       3
7    dog       2
  • Related