I want to split all rows into two groups that have similar means.
I have a dataframe of about 50 rows but this could go into several thousands with a column of interest called 'value'.
value total bucket
300048137 3.0741 3.0741 0
352969997 2.1024 5.1765 0
abc13.com 4.5237 9.7002 0
abc7.com 5.8202 15.5204 0
abcnews.go.com 6.7270 22.2474 0
........
www.legacy.com 12.6609 263.0797 1
www.math-aids.com 10.9832 274.0629 1
So far I tried using cumulative sum for which total column was created then I essentially made the split based on where the mid-point of the total column is. Based on this solution.
test['total'] = test['value'].cumsum()
df_sum = test['value'].sum()//2
test['bucket'] = np.where(test['total'] <= df_sum, 0,1)
If I try to group them and take the average for each group then the difference is quite significant
display(test.groupby('bucket')['value'].mean())
bucket
0 7.456262
1 10.773905
Is there a way I could achieve this partition based on means instead of sums? I was thinking about using expanding means from pandas but couldn't find a proper way to do it.
CodePudding user response:
I am not sure I understand what you are trying to do, but possibly you want to groupy by quantiles of a column. If so:
test['bucket'] = pd.qcut(test['value'], q=2, labels=False)
which will have bucket=0 for the half of rows with the lesser value
values. And 1 for the rest. By tweakign the q
parameter you can have as many groups as you want (as long as <= number of rows).
Edit: New attemp, now that I think I understand better your aim:
df = pd.DataFrame( {'value':pd.np.arange(100)})
df['group'] = df['value'].argsort().mod(2)
df.groupby('group')['value'].mean()
# group
# 0 49
# 1 50
# Name: value, dtype: int64
df['group'] = df['value'].argsort().mod(3)
df.groupby('group')['value'].mean()
#group
# 0 49.5
# 1 49.0
# 2 50.0
# Name: value, dtype: float64