Home > OS >  Combined aggregate based on valid values
Combined aggregate based on valid values

Time:11-08

I have a df with this structure:

id  a1_l1   a2_l1   a3_l1   a1_l2   a2_l2   a3_l2
1   1       5       3       1       2       3
2   1       5       3       1       2       3
3   2       5       3       5       5       3
4   5       5       3       5       5       3
5   5       5       2           
6   5       5       2           
7   5       5       2           
8   2       5       2           
9   3       5       1           
10  3       5       1   

I want to summarize in a table such that I get:

    l1  l2
a1  0.4 0.5
a2  1   0.5
a3  0   0

In which what I'm doing is counting how may times 5 was present divided by the number of valid responses, so that for example:

a1, l1 is equal to .4 as I have 4 values of 5 divided by 10. and a2, l1 equals .5 as I have 2 values of 5 divided by 4 valid responses per column.

Thanks!

CodePudding user response:

Try with pd.wide_to_long

s = pd.wide_to_long(df,['a1','a2','a3'],i='id',j = 'level',sep='_',suffix='\\w ')
out = s.eq(5).groupby(level=1).sum()
out = out.T.div(s.groupby(level=1).size())
out
level   l1   l2
a1     0.4  0.2
a2     1.0  0.2
a3     0.0  0.0

CodePudding user response:

You can drop the id column drop('id', axis=1) as well instead of set_index('id')

  1. calculate the percentage for each column using (x == 5).sum() / x.notna().sum()
  2. reset the index to get a column we can work with
  3. generate the 'a' and 'l' columns by splitting the index
  4. pivot
tmp = df.set_index('id').apply(lambda x: x.eq(5).sum()/x.notna().sum()).reset_index()
tmp[['a', 'l']] = tmp['index'].str.split('_', expand=True)


>>>
   index    0   a   l
0  a1_l1  0.4  a1  l1
1  a2_l1  1.0  a2  l1
2  a3_l1  0.0  a3  l1
3  a1_l2  0.5  a1  l2
4  a2_l2  0.5  a2  l2
5  a3_l2  0.0  a3  l2

tmp.drop('index', axis=1).pivot(index='a', columns='l').droplevel(0, axis=1).rename_axis(None).rename_axis(None, axis=1)

>>>
     l1   l2
a1  0.4  0.5
a2  1.0  0.5
a3  0.0  0.0

CodePudding user response:

You can reshape to have a dataframe with MultiIndex, then perform a simple division of the (sum of the truthy values equal to 5) by not na. Finally, unstack:

df2 = df.set_index('id')
df2.columns = pd.MultiIndex.from_arrays(zip(*df2.columns.map(lambda x: x.split('_'))))
df2 = (df2.eq(5).sum()/df2.notna().sum()).unstack()

output:

     l1   l2
a1  0.4  0.5
a2  1.0  0.5
a3  0.0  0.0

CodePudding user response:

I'm very impressed by all these cool solutions, so I upvoted all of them, but they all look basically equally complex. So I decided to run some benchmarks to help the OP make a decision.

Solution Average time to run
mozway 2.1 ms
Asish M. 4.5 ms
BENY 13.3 ms

Congrats, @mozway!


Details:

# mozway
2.15 ms ± 29 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.09 ms ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Asish M.
4.51 ms ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.49 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.48 ms ± 20.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.51 ms ± 29.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.48 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# BENY
13.5 ms ± 419 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.3 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.3 ms ± 285 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.3 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.2 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • Related