I have some dataframe with a column variable that contains several levels, along with corresponding values in another col. I'd like to calculate the differences between each observation across each pair of levels (the observations are related). I can do it by pivoting wide, getting each pair using itertools, making a new column as the difference of each pair, then pivoting back to long. But is there a more efficient way without pivoting around? Here is my code:
import numpy as np
import pandas as pd
import itertools
n = 10
np.random.seed(0)
df = pd.DataFrame(
{
'obs': np.arange(1, n 1),
'a': np.random.normal(0, 2.5, n),
'b': np.random.normal(5, 2.5, n),
'c': np.random.normal(0, 2.5, n),
'd': np.random.normal(2, 2.5, n)
}
).melt(id_vars='obs')
df.head(15)
obs variable value
0 1 a 4.410131
1 2 a 1.000393
2 3 a 2.446845
3 4 a 5.602233
4 5 a 4.668895
5 6 a -2.443195
6 7 a 2.375221
7 8 a -0.378393
8 9 a -0.258047
9 10 a 1.026496
10 1 b 5.360109
11 2 b 8.635684
12 3 b 6.902594
13 4 b 5.304188
14 5 b 6.109658
# My approach
groupings = df['variable'].unique() # gets each unique "group"
# Pivot wide
wide = df.pivot_table(index='obs', values='value', columns='variable')
# Get all perms
pairs = list(itertools.permutations(groupings, 2))
# Calculate the differences
for a, b in pairs:
wide[f"{a} MINUS {b}"] = wide[a] - wide[b]
# Pivot back to long
difference_data = wide[[c for c in wide.columns if 'MINUS' in c]].reset_index().melt(id_vars='obs')
difference_data.head(15)
obs variable value
0 1 a MINUS b -0.949978
1 2 a MINUS b -7.635291
2 3 a MINUS b -4.455749
3 4 a MINUS b 0.298045
4 5 a MINUS b -1.440763
5 6 a MINUS b -8.277381
6 7 a MINUS b -6.359977
7 8 a MINUS b -4.865497
8 9 a MINUS b -6.040716
9 10 a MINUS b -1.838264
10 1 a MINUS c 10.792605
11 2 a MINUS c -0.633653
12 3 a MINUS c 0.285754
13 4 a MINUS c 7.457646
14 5 a MINUS c -1.005492
CodePudding user response:
You can use itertools.combinations
(or permutations
) directly on the groupby
and concat
:
groups = df.set_index('obs').groupby('variable')['value']
from itertools import combinations
out = (pd
.concat({f'{k1} MINUS {k2}': g1-g2
for (k1, g1), (k2, g2) in combinations(groups, 2)},
names=['variable']
)
.reset_index()
)
output:
variable obs value
0 a MINUS b 1 -0.949978
1 a MINUS b 2 -7.635291
2 a MINUS b 3 -4.455749
3 a MINUS b 4 0.298045
4 a MINUS b 5 -1.440763
5 a MINUS b 6 -8.277381
6 a MINUS b 7 -6.359977
7 a MINUS b 8 -4.865497
8 a MINUS b 9 -6.040716
9 a MINUS b 10 -1.838264
10 a MINUS c 1 10.792605
11 a MINUS c 2 -0.633653
12 a MINUS c 3 0.285754
13 a MINUS c 4 7.457646
14 a MINUS c 5 -1.005492
15 a MINUS c 6 1.192719
16 a MINUS c 7 2.260825
17 a MINUS c 8 0.089567
18 a MINUS c 9 -4.089995
19 a MINUS c 10 -2.646901
20 a MINUS d 1 2.022762
21 a MINUS d 2 -1.945013
22 a MINUS d 3 2.666309
23 a MINUS d 4 8.554224
24 a MINUS d 5 3.538675
25 a MINUS d 6 -4.834067
26 a MINUS d 7 -2.700506
27 a MINUS d 8 -5.384343
28 a MINUS d 9 -1.289730
29 a MINUS d 10 -0.217747
30 b MINUS c 1 11.742583
31 b MINUS c 2 7.001637
32 b MINUS c 3 4.741504
33 b MINUS c 4 7.159600
34 b MINUS c 5 0.435272
35 b MINUS c 6 9.470100
36 b MINUS c 7 8.620801
37 b MINUS c 8 4.955064
38 b MINUS c 9 1.950721
39 b MINUS c 10 -0.808636
40 b MINUS d 1 2.972740
41 b MINUS d 2 5.690277
42 b MINUS d 3 7.122059
43 b MINUS d 4 8.256179
44 b MINUS d 5 4.979438
45 b MINUS d 6 3.443313
46 b MINUS d 7 3.659471
47 b MINUS d 8 -0.518845
48 b MINUS d 9 4.750986
49 b MINUS d 10 1.620518
50 c MINUS d 1 -8.769843
51 c MINUS d 2 -1.311360
52 c MINUS d 3 2.380555
53 c MINUS d 4 1.096579
54 c MINUS d 5 4.544167
55 c MINUS d 6 -6.026787
56 c MINUS d 7 -4.961330
57 c MINUS d 8 -5.473909
58 c MINUS d 9 2.800265
59 c MINUS d 10 2.429154