Pandas: Groupby with group size limited, based on rank of items-CodePudding

I'm trying to find a way to group the following DataFrame such:

I df.groupby("Strike"), but limit the groupsize to 2.
In the case of groupsize > 2 (as in the example below), I would want the largest ( ) grouped with the largest (-) based on the Size column, and so on until there are no more pairs left.
Lastly, for the grouping, each ( ) needs an opposing (-). [preventing index 3 & 6 from being grouped]

Note:

The order may not be as clean as in the example below, so relative positioning won't work as an approach.
The below example is a subset of a larger dataframe fwiw.

Index	Size	Strike	Structure
0	170	1.8	( )
1	140	2.8	( )
2	62	2.8	(-)
3	170	1.82	( )
4	170	2.82	( )
5	120	2.82	(-)
6	170	1.82	( )
7	140	2.82	( )
8	62	2.82	(-)

Desired output:

Index	Size	Strike	Structure	Group
0	170	1.8	( )	1
1	140	2.8	( )	2
2	62	2.8	(-)	2
3	170	1.82	( )	3
4	170	2.82	( )	5
5	120	2.82	(-)	5
6	170	1.82	( )	4
7	140	2.82	( )	6
8	62	2.82	(-)	6

I was trying to use ["Size"].rank() as my approach, but I got stuck trying to then re-assign group numbers based on the descending "Size"

Thx!

CodePudding user response：

This ended up being somewhat complicated, so I am not entirely sure the logic is correct, but here is an attempt. I will break it down with explanation, but the full code is in a block below.

Note this approach depends on the column Index being present - if it is actually the DataFrame index, you will need to do something like df.reset_index().

First, separate the data into the "plus" and "minus" segments:

plus = df[df['Structure'] == '( )'].copy()
minus = df[df['Structure'] == '(-)'].copy()

Second, assign a descending grouped rank to each, based on the Strike column:

plus['Rank'] = plus.groupby("Strike")["Size"].rank("first", ascending=False)
minus['Rank'] = minus.groupby("Strike")["Size"].rank("first", ascending=False)

Third, merge the two subsets, based on Strike and Rank.

cols= ['Strike', 'Rank', 'Index']
merge = plus[cols].merge(minus[cols], on=['Strike', 'Rank'], how='outer')

The merge needs to be an outer merge, such that the rows from both plus and minus are all included. This should pair items with the same Strike and Rank, and all pairs will be a ( ) and (-) observation paired together. Observations without a pair will still be included, but will have NaN values for their missing counterpart. The Index column is included in the merge, as this will be necessary assigning the group labels back to the original DataFrame.

It is probably helpful to show what this merge looks like:

   Strike  Rank  Index_x  Index_y
0    1.80   1.0        0      NaN
1    2.80   1.0        1      2.0
2    1.82   1.0        3      NaN
3    2.82   1.0        4      5.0
4    1.82   2.0        6      NaN
5    2.82   2.0        7      8.0

Here, the index (i.e. merge.index, the pandas DataFrame index object) has the group labels. The Index_x and Index_y are products of the merge, but they provide a linking back to the rows of the original DataFrame. The Strike and Rank columns show what each group is capturing.

Fourth, create a Series for linking the Index column values to their group labels. This concatenates all the values from Index_x and Index_y, and takes the corresponding group assignment based on the merge.index.

mapper = pd.concat([merge['Index_x'], merge['Index_y']]).dropna()
mapper = pd.Series(mapper.index, mapper)
df['Group'] = df['Index'].map(mapper)   1

Here is the final result:

>>> df 
   Index  Size  Strike Structure  Group
0      0   170    1.80       ( )      1
1      1   140    2.80       ( )      2
2      2    62    2.80       (-)      2
3      3   170    1.82       ( )      3
4      4   170    2.82       ( )      4
5      5   120    2.82       (-)      4
6      6   170    1.82       ( )      5
7      7   140    2.82       ( )      6
8      8    62    2.82       (-)      6

There is a slight difference from your expected output in the order of group labels - but the group assignments are the same. There may be an additional sort which can be added to get your exact output, but for now I'm assuming that's not required.

Thanks for the challenging question!

Full code:

plus = df[df['Structure'] == '( )'].copy()
minus = df[df['Structure'] == '(-)'].copy()

plus['Rank'] = plus.groupby("Strike")["Size"].rank("first", ascending=False)
minus['Rank'] = minus.groupby("Strike")["Size"].rank("first", ascending=False)

cols= ['Strike', 'Rank', 'Index']
merge = plus[cols].merge(minus[cols], on=['Strike', 'Rank'], how='outer')

mapper = pd.concat([merge['Index_x'], merge['Index_y']]).dropna()
mapper = pd.Series(data=mapper.index, index=mapper)
df['Group'] = df['Index'].map(mapper)   1