Count occurrences by group and add a column of percentages in python-CodePudding

My problem: Given a data frame I want to count occurrences of "is_bot" by "processed" as a grouping variable and then calculate the percentages using "is_bot" as the grouping variable.

My data: My data looks like this:

    is_bot  processed
0   No Bot  2
1   self_declared   3
2   self_declared   1
3   No Bot  3
4   No Bot  2
5   No Bot  1
6   No Bot  3
7   No Bot  1
8   No Bot  1
9   No Bot  2

So far: I have successfully calculated the occurrences. Then I managed to calculate the percentages. But I can't manage to do it within the same data frame.

foo = df.groupby(["processed", "is_bot"]).size()
foo

processed  is_bot       
1          No Bot           3
           self_declared    1
2          No Bot           3
3          No Bot           2
           self_declared    1
dtype: int64

foo.groupby("is_bot").transform(lambda x: 100*x/x.sum())

processed  is_bot       
1          No Bot           37.5
           self_declared    50.0
2          No Bot           37.5
3          No Bot           25.0
           self_declared    50.0
dtype: float64

My Data: This is the dict for my raw data:

df = {'is_bot': {0: 'No Bot', 1: 'self_declared', 2: 'self_declared', 3: 'No Bot', 4: 'No Bot', 5: 'No Bot', 6: 'No Bot', 7: 'No Bot', 8: 'No Bot', 9: 'No Bot'}, 'processed': {0: 2, 1: 3, 2: 1, 3: 3, 4: 2, 5: 1, 6: 3, 7: 1, 8: 1, 9: 2}}

Extra: I am able to do it in R using dplyr:

df %>% 
  group_by(processed) %>% 
  count(is_bot) %>% 
  ungroup() %>% 
  group_by(is_bot) %>% 
  mutate(perc = n/sum(n)*100)

CodePudding user response：

You need grouping by both columns first, get counts and then aggregate again but only one column. Here are possible alterative solutions:

df1 = (df.groupby(["processed", "is_bot"])
         .size()
         .div(df['is_bot'].value_counts(), level=1)
         .mul(100))
print (df1)

Or join both:

df1 = (df.groupby(["processed", "is_bot"]).size()
         .groupby("is_bot").transform(lambda x: 100*x/x.sum()))

Or use crosstab with normalize parameter, but it count also not match values like 0, so need replace them and reshape by DataFrame.stack:

df1 = (pd.crosstab(df['processed'], df['is_bot'], normalize=1)
         .replace(0, np.nan)
         .stack()
         .mul(100))
print (df1)
processed  is_bot       
1          No Bot           37.5
           self_declared    50.0
2          No Bot           37.5
3          No Bot           25.0
           self_declared    50.0
dtype: float64

For both columns use:

s = df.groupby(["processed", "is_bot"]).size()
s1 = s.groupby("is_bot").transform(lambda x: 100*x/x.sum())

df2 = pd.concat([s, s1], axis=1, keys=('count','perc'))

Or:

df2 = (df.groupby(["processed", "is_bot"])
         .size()
         .to_frame('count')
         .assign(perc = lambda x: x.groupby("is_bot").transform(lambda x: 100*x/x.sum())))

print (df2)
                         count  perc
processed is_bot                    
1         No Bot             3  37.5
          self_declared      1  50.0
2         No Bot             3  37.5
3         No Bot             2  25.0
          self_declared      1  50.0