Home > front end >  How to get max of counts for groupby
How to get max of counts for groupby

Time:01-21

I have a dataframe. I want to group by rows on some columns and then form a count column and then get the max of counts and create a column for it and attach it to dataframe.

I tried:

    df["max_pred"] = df.groupby(['fid','prefix','pred_text1'], 
                             sort=False)["pred_text1"].transform("max")

However it lists a row with max repeat for pred_text1, but I want the number of reparation for it

For example:

A  B  C
a  d  b
a  d  b
a  d  b
a  d  a
a  d  a
b  b  c
b  b  c
b  b  d

If I group the rows by A and B and then count C and get its max for each group and store that in new column F, I expect:

A  B  F   E
a  d  3   b
a  d  3   b
a  d  3   b
a  d  3   b
a  d  3   b
b  b  2   c
b  b  2   c
b  b  2   c

Also, E shows the most frequent item specified by max value

CodePudding user response:

You can use groupby.transform with value_counts:

df['F'] = (df.groupby(['A', 'B'])['C']
             .transform(lambda g: g.value_counts(sort=False).max())
          )

Variant with collections.Counter:

from collections import Counter

df['F'] = (df.groupby(['A', 'B'])['C']
             .transform(lambda g: max(Counter(g).values()))
          )

Output:

   A  B  C  F
0  a  d  b  3
1  a  d  b  3
2  a  d  b  3
3  a  d  a  3
4  a  d  a  3
5  b  b  c  2
6  b  b  c  2
7  b  b  d  2

update

I would use a merge here:

cols = ['A', 'B']
out = df.merge(df.value_counts().groupby(cols).head(1)
                 .reset_index(name='F').rename(columns={'C': 'E'})
               )

Output:

   A  B  C  E  F
0  a  d  b  b  3
1  a  d  b  b  3
2  a  d  b  b  3
3  a  d  a  b  3
4  a  d  a  b  3
5  b  b  c  c  2
6  b  b  c  c  2
7  b  b  d  c  2

CodePudding user response:

Another option is with get_dummies; for large enough data, I'd expect @mozway's solution to scale better:

temp = (pd
       .get_dummies(df, columns = ['C'], prefix="",prefix_sep="")
       .groupby(['A','B'])
       .transform('sum'))
(df
.drop(columns='C')
.assign(F=temp.max(1), E = temp.idxmax(1))
)
   A  B  F  E
0  a  d  3  b
1  a  d  3  b
2  a  d  3  b
3  a  d  3  b
4  a  d  3  b
5  b  b  2  c
6  b  b  2  c
7  b  b  2  c
  • Related