I need to group a frame by key. For each group there could be :
- one couple of id, where 'max registered' is a unique value I need to keep
- two couples of id : id1-id2 and id2-id1 where I need to keep the max between their 'max registered' or one of them if their 'max registered' are equal and keep only one of the couples (because id1-id2 and id2-id1 should be considered as one couple, because we don't care about the order of the ids in a couple)
- more than two couples of id : it could be combinations of case 1 = one couple, and case 2 = two couples. They need to be treated like case 1 and case 2 inside the same group of key.
Here is the original dataframe :
df = pd.DataFrame({
'first': ['A', 'B', 'A1', 'B1', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K'],
'second': ['B', 'A', 'B1', 'A1', 'D', 'C', 'F', 'E', 'H', 'G', 'J', 'L'],
'key': ['AB', 'AB', 'AB', 'AB', 'CD', 'CD', 'EF', 'EF', 'GH', 'GH', 'IJ', 'KL'],
'max registered': [10, 5, 10, 5, 'NaN', 15, 10, 5, 'NaN', 'NaN', 'NaN', 15]
})
df
first second key max registered
0 A B AB 10
1 B A AB 5
2 A1 B1 AB 10
3 B1 A1 AB 5
4 C D CD NaN
5 D C CD 15
6 E F EF 10
7 F E EF 5
8 G H GH NaN
9 H G GH NaN
10 I J IJ NaN
11 K L KL 15
Here is what dataframe should look like once it as been grouped and (here comes my problem) aggregated/filtered/transformed/applied ? I don't know how to do it after grouping my data and what solution I should opt for.
df = pd.DataFrame({
'first': ['A', 'A1', 'D', 'E', 'G', 'I', 'K'],
'second': ['B', 'B1', 'C', 'F', 'H', 'J', 'L'],
'key': ['AB', 'AB', 'CD', 'EF', 'GH', 'IJ', 'KL'],
'max registered': [10, 10, 15, 10, 'NaN', 'NaN', 15]
})
df
first second key max registered
0 A B AB 10
1 A1 B1 AB 10
2 D C CD 15
3 E F EF 10
4 G H GH NaN
5 I J IJ NaN
6 K L KL 15
I'm watching tutorials about groupby() and pandas documentation since 2 days without finding any clue of the logic behind it and the way I should do this. My problem is (as I see it) more complicated and not really related to what's treated in those tutorials (for example this one that I watched several times)
CodePudding user response:
Create ordered group from first
and second
columns. key
is useless here since your want all max for each subgroup (max for (A,B) and max for (A1,B1)) then sort values by max registered
by descending order. Finally group by this virtual groups and keep the first value (the max):
out = df.assign(group=df[['first', 'second']].apply(frozenset, axis=1)) \
.sort_values('max registered', ascending=False) \
.groupby('group').head(1).sort_index()
print(out)
first second key max registered group
0 A B AB 10.0 (A, B)
2 A1 B1 AB 10.0 (B1, A1)
5 D C CD 15.0 (C, D)
6 E F EF 10.0 (E, F)
8 G H GH NaN (G, H)
10 I J IJ NaN (J, I)
11 K L KL 15.0 (K, L)