I'm trying to find a way to group the following DataFrame such:
- I
df.groupby("Strike")
, but limit the groupsize to 2. - In the case of groupsize > 2 (as in the example below), I would want the largest ( ) grouped with the largest (-) based on the Size column, and so on until there are no more pairs left.
- Lastly, for the grouping, each ( ) needs an opposing (-). [preventing index 3 & 6 from being grouped]
Note:
- The order may not be as clean as in the example below, so relative positioning won't work as an approach.
- The below example is a subset of a larger dataframe fwiw.
Index | Size | Strike | Structure |
---|---|---|---|
0 | 170 | 1.8 | ( ) |
1 | 140 | 2.8 | ( ) |
2 | 62 | 2.8 | (-) |
3 | 170 | 1.82 | ( ) |
4 | 170 | 2.82 | ( ) |
5 | 120 | 2.82 | (-) |
6 | 170 | 1.82 | ( ) |
7 | 140 | 2.82 | ( ) |
8 | 62 | 2.82 | (-) |
Desired output:
Index | Size | Strike | Structure | Group |
---|---|---|---|---|
0 | 170 | 1.8 | ( ) | 1 |
1 | 140 | 2.8 | ( ) | 2 |
2 | 62 | 2.8 | (-) | 2 |
3 | 170 | 1.82 | ( ) | 3 |
4 | 170 | 2.82 | ( ) | 5 |
5 | 120 | 2.82 | (-) | 5 |
6 | 170 | 1.82 | ( ) | 4 |
7 | 140 | 2.82 | ( ) | 6 |
8 | 62 | 2.82 | (-) | 6 |
I was trying to use ["Size"].rank() as my approach, but I got stuck trying to then re-assign group numbers based on the descending "Size"
Thx!
CodePudding user response:
This ended up being somewhat complicated, so I am not entirely sure the logic is correct, but here is an attempt. I will break it down with explanation, but the full code is in a block below.
Note this approach depends on the column Index
being present - if it is actually the DataFrame index, you will need to do something like df.reset_index()
.
First, separate the data into the "plus" and "minus" segments:
plus = df[df['Structure'] == '( )'].copy()
minus = df[df['Structure'] == '(-)'].copy()
Second, assign a descending grouped rank to each, based on the Strike
column:
plus['Rank'] = plus.groupby("Strike")["Size"].rank("first", ascending=False)
minus['Rank'] = minus.groupby("Strike")["Size"].rank("first", ascending=False)
Third, merge the two subsets, based on Strike
and Rank
.
cols= ['Strike', 'Rank', 'Index']
merge = plus[cols].merge(minus[cols], on=['Strike', 'Rank'], how='outer')
The merge needs to be an outer merge, such that the rows from both plus
and minus
are all included. This should pair items with the same Strike
and Rank
, and all pairs will be a ( )
and (-)
observation paired together. Observations without a pair will still be included, but will have NaN
values for their missing counterpart. The Index
column is included in the merge, as this will be necessary assigning the group labels back to the original DataFrame.
It is probably helpful to show what this merge
looks like:
Strike Rank Index_x Index_y
0 1.80 1.0 0 NaN
1 2.80 1.0 1 2.0
2 1.82 1.0 3 NaN
3 2.82 1.0 4 5.0
4 1.82 2.0 6 NaN
5 2.82 2.0 7 8.0
Here, the index (i.e. merge.index
, the pandas DataFrame index object) has the group labels. The Index_x
and Index_y
are products of the merge, but they provide a linking back to the rows of the original DataFrame. The Strike
and Rank
columns show what each group is capturing.
Fourth, create a Series for linking the Index
column values to their group labels. This concatenates all the values from Index_x
and Index_y
, and takes the corresponding group assignment based on the merge.index
.
mapper = pd.concat([merge['Index_x'], merge['Index_y']]).dropna()
mapper = pd.Series(mapper.index, mapper)
df['Group'] = df['Index'].map(mapper) 1
Here is the final result:
>>> df
Index Size Strike Structure Group
0 0 170 1.80 ( ) 1
1 1 140 2.80 ( ) 2
2 2 62 2.80 (-) 2
3 3 170 1.82 ( ) 3
4 4 170 2.82 ( ) 4
5 5 120 2.82 (-) 4
6 6 170 1.82 ( ) 5
7 7 140 2.82 ( ) 6
8 8 62 2.82 (-) 6
There is a slight difference from your expected output in the order of group labels - but the group assignments are the same. There may be an additional sort which can be added to get your exact output, but for now I'm assuming that's not required.
Thanks for the challenging question!
Full code:
plus = df[df['Structure'] == '( )'].copy()
minus = df[df['Structure'] == '(-)'].copy()
plus['Rank'] = plus.groupby("Strike")["Size"].rank("first", ascending=False)
minus['Rank'] = minus.groupby("Strike")["Size"].rank("first", ascending=False)
cols= ['Strike', 'Rank', 'Index']
merge = plus[cols].merge(minus[cols], on=['Strike', 'Rank'], how='outer')
mapper = pd.concat([merge['Index_x'], merge['Index_y']]).dropna()
mapper = pd.Series(data=mapper.index, index=mapper)
df['Group'] = df['Index'].map(mapper) 1