I have a dataframe subdata
:
PageId VolumePred ConversionPred OSBrowser
1005581 8.0 7.0 (11, 16)
1016529 175.0 85.0 (11, 16)
1016529 4.0 4.0 (11, 16)
1016529 4.0 4.0 (12, 14)
1016529 29.0 19.0 (12, 16)
I try to select only the PageId
that the cumulated sum of VolumePred
<450. We should select at first pageId
that has the bigger value of ConversionPred
.
For this, I started to sort the dataframe by ConversionPred
:
subdata.groupby('OSBrowser')
subdata.sort_values(by='ConversionPred', ascending=False)
Then, I calculated the sum of ConversionPred
and VolumePred
for each group of OSBrowser:
h=subdata.groupby(['PageId', 'OSBrowser'])[['ConversionPred', 'VolumePred']].agg('sum')
Then:
h=h.sort_values(['ConversionPred'],ascending=False).groupby('OSBrowser')
df = h[h['VolumePred'].cumsum() < 450]
But I got this error :
KeyError: 'Columns not found: False, True
The expected output :
PageId VolumePred ConversionPred OSBrowser
1005581 8.0 7.0 (11, 16)
1016529 179.0 89.0 (11, 16)
1016529 4.0 4.0 (12, 14)
1016529 29.0 19.0 (12, 16)
So we at first select this:
1016529 179.0 89.0 (11, 16)
Then:
1005581 8.0 7.0 (11, 16)
because the sum of volume : 179 8 is less than 450, so they will be selected
CodePudding user response:
If I understand correctly, try:
h = subdata.groupby(["PageId", "OSBrowser"]).sum().reset_index().sort_values(["OSBrowser","ConversionPred"],ascending=[True,False])
output = h[h.groupby("OSBrowser")["VolumePred"].transform("cumsum").lt(450)].reset_index(drop=True)
>>> output
PageId OSBrowser VolumePred ConversionPred
0 1016529 (11, 16) 179.0 89.0
1 1005581 (11, 16) 8.0 7.0
2 1016529 (12, 14) 4.0 4.0
3 1016529 (12, 16) 29.0 19.0