select item with bigger value of cumulated sum-CodePudding

I have a dataframe subdata:

PageId  VolumePred  ConversionPred  OSBrowser
1005581 8.0               7.0       (11, 16)
1016529 175.0            85.0       (11, 16)
1016529 4.0               4.0       (11, 16)
1016529 4.0               4.0       (12, 14)
1016529 29.0             19.0       (12, 16)

I try to select only the PageId that the cumulated sum of VolumePred <450. We should select at first pageId that has the bigger value of ConversionPred.

For this, I started to sort the dataframe by ConversionPred:

subdata.groupby('OSBrowser')
subdata.sort_values(by='ConversionPred', ascending=False)

Then, I calculated the sum of ConversionPred and VolumePred for each group of OSBrowser:

h=subdata.groupby(['PageId', 'OSBrowser'])[['ConversionPred', 'VolumePred']].agg('sum')

Then:

h=h.sort_values(['ConversionPred'],ascending=False).groupby('OSBrowser')
df = h[h['VolumePred'].cumsum() < 450]

But I got this error :

KeyError: 'Columns not found: False, True

The expected output :

PageId  VolumePred  ConversionPred  OSBrowser
1005581 8.0               7.0       (11, 16)
1016529 179.0            89.0       (11, 16)
1016529 4.0               4.0       (12, 14)
1016529 29.0             19.0       (12, 16)

So we at first select this:

1016529 179.0            89.0       (11, 16)

Then:

1005581 8.0               7.0       (11, 16)

because the sum of volume : 179 8 is less than 450, so they will be selected

CodePudding user response：

If I understand correctly, try:

h = subdata.groupby(["PageId", "OSBrowser"]).sum().reset_index().sort_values(["OSBrowser","ConversionPred"],ascending=[True,False])
output = h[h.groupby("OSBrowser")["VolumePred"].transform("cumsum").lt(450)].reset_index(drop=True)

>>> output
    PageId OSBrowser  VolumePred  ConversionPred
0  1016529  (11, 16)       179.0            89.0
1  1005581  (11, 16)         8.0             7.0
2  1016529  (12, 14)         4.0             4.0
3  1016529  (12, 16)        29.0            19.0