I have a df with this structure:
id a1_l1 a2_l1 a3_l1 a1_l2 a2_l2 a3_l2
1 1 5 3 1 2 3
2 1 5 3 1 2 3
3 2 5 3 5 5 3
4 5 5 3 5 5 3
5 5 5 2
6 5 5 2
7 5 5 2
8 2 5 2
9 3 5 1
10 3 5 1
I want to summarize in a table such that I get:
l1 l2
a1 0.4 0.5
a2 1 0.5
a3 0 0
In which what I'm doing is counting how may times 5 was present divided by the number of valid responses, so that for example:
a1, l1 is equal to .4 as I have 4 values of 5 divided by 10. and a2, l1 equals .5 as I have 2 values of 5 divided by 4 valid responses per column.
Thanks!
CodePudding user response:
Try with pd.wide_to_long
s = pd.wide_to_long(df,['a1','a2','a3'],i='id',j = 'level',sep='_',suffix='\\w ')
out = s.eq(5).groupby(level=1).sum()
out = out.T.div(s.groupby(level=1).size())
out
level l1 l2
a1 0.4 0.2
a2 1.0 0.2
a3 0.0 0.0
CodePudding user response:
You can drop the id
column drop('id', axis=1)
as well instead of set_index('id')
- calculate the percentage for each column using
(x == 5).sum() / x.notna().sum()
- reset the index to get a column we can work with
- generate the 'a' and 'l' columns by splitting the index
- pivot
tmp = df.set_index('id').apply(lambda x: x.eq(5).sum()/x.notna().sum()).reset_index()
tmp[['a', 'l']] = tmp['index'].str.split('_', expand=True)
>>>
index 0 a l
0 a1_l1 0.4 a1 l1
1 a2_l1 1.0 a2 l1
2 a3_l1 0.0 a3 l1
3 a1_l2 0.5 a1 l2
4 a2_l2 0.5 a2 l2
5 a3_l2 0.0 a3 l2
tmp.drop('index', axis=1).pivot(index='a', columns='l').droplevel(0, axis=1).rename_axis(None).rename_axis(None, axis=1)
>>>
l1 l2
a1 0.4 0.5
a2 1.0 0.5
a3 0.0 0.0
CodePudding user response:
You can reshape to have a dataframe with MultiIndex, then perform a simple division of the (sum
of the truthy values equal to 5) by not na. Finally, unstack
:
df2 = df.set_index('id')
df2.columns = pd.MultiIndex.from_arrays(zip(*df2.columns.map(lambda x: x.split('_'))))
df2 = (df2.eq(5).sum()/df2.notna().sum()).unstack()
output:
l1 l2
a1 0.4 0.5
a2 1.0 0.5
a3 0.0 0.0
CodePudding user response:
I'm very impressed by all these cool solutions, so I upvoted all of them, but they all look basically equally complex. So I decided to run some benchmarks to help the OP make a decision.
Solution | Average time to run |
---|---|
mozway | 2.1 ms |
Asish M. | 4.5 ms |
BENY | 13.3 ms |
Congrats, @mozway!
Details:
# mozway
2.15 ms ± 29 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.09 ms ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Asish M.
4.51 ms ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.49 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.48 ms ± 20.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.51 ms ± 29.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.48 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# BENY
13.5 ms ± 419 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.3 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.3 ms ± 285 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.3 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.2 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)