My problem: Given a data frame I want to count occurrences of "is_bot" by "processed" as a grouping variable and then calculate the percentages using "is_bot" as the grouping variable.
My data: My data looks like this:
is_bot processed
0 No Bot 2
1 self_declared 3
2 self_declared 1
3 No Bot 3
4 No Bot 2
5 No Bot 1
6 No Bot 3
7 No Bot 1
8 No Bot 1
9 No Bot 2
So far: I have successfully calculated the occurrences. Then I managed to calculate the percentages. But I can't manage to do it within the same data frame.
foo = df.groupby(["processed", "is_bot"]).size()
foo
processed is_bot
1 No Bot 3
self_declared 1
2 No Bot 3
3 No Bot 2
self_declared 1
dtype: int64
foo.groupby("is_bot").transform(lambda x: 100*x/x.sum())
processed is_bot
1 No Bot 37.5
self_declared 50.0
2 No Bot 37.5
3 No Bot 25.0
self_declared 50.0
dtype: float64
My Data: This is the dict for my raw data:
df = {'is_bot': {0: 'No Bot', 1: 'self_declared', 2: 'self_declared', 3: 'No Bot', 4: 'No Bot', 5: 'No Bot', 6: 'No Bot', 7: 'No Bot', 8: 'No Bot', 9: 'No Bot'}, 'processed': {0: 2, 1: 3, 2: 1, 3: 3, 4: 2, 5: 1, 6: 3, 7: 1, 8: 1, 9: 2}}
Extra: I am able to do it in R
using dplyr
:
df %>%
group_by(processed) %>%
count(is_bot) %>%
ungroup() %>%
group_by(is_bot) %>%
mutate(perc = n/sum(n)*100)
CodePudding user response:
You need grouping by both columns first, get counts and then aggregate again but only one column. Here are possible alterative solutions:
df1 = (df.groupby(["processed", "is_bot"])
.size()
.div(df['is_bot'].value_counts(), level=1)
.mul(100))
print (df1)
Or join both:
df1 = (df.groupby(["processed", "is_bot"]).size()
.groupby("is_bot").transform(lambda x: 100*x/x.sum()))
Or use crosstab
with normalize
parameter, but it count also not match values like 0
, so need replace them and reshape by DataFrame.stack
:
df1 = (pd.crosstab(df['processed'], df['is_bot'], normalize=1)
.replace(0, np.nan)
.stack()
.mul(100))
print (df1)
processed is_bot
1 No Bot 37.5
self_declared 50.0
2 No Bot 37.5
3 No Bot 25.0
self_declared 50.0
dtype: float64
For both columns use:
s = df.groupby(["processed", "is_bot"]).size()
s1 = s.groupby("is_bot").transform(lambda x: 100*x/x.sum())
df2 = pd.concat([s, s1], axis=1, keys=('count','perc'))
Or:
df2 = (df.groupby(["processed", "is_bot"])
.size()
.to_frame('count')
.assign(perc = lambda x: x.groupby("is_bot").transform(lambda x: 100*x/x.sum())))
print (df2)
count perc
processed is_bot
1 No Bot 3 37.5
self_declared 1 50.0
2 No Bot 3 37.5
3 No Bot 2 25.0
self_declared 1 50.0