Home > Mobile >  Pandas - partition a dataframe into two groups with an approximate mean value
Pandas - partition a dataframe into two groups with an approximate mean value

Time:06-01

I want to split all rows into two groups that have similar means.

I have a dataframe of about 50 rows but this could go into several thousands with a column of interest called 'value'.

             value  total   bucket
300048137   3.0741  3.0741  0
352969997   2.1024  5.1765  0
abc13.com   4.5237  9.7002  0
abc7.com    5.8202  15.5204 0
abcnews.go.com  6.7270  22.2474 0
........
www.legacy.com  12.6609 263.0797    1
www.math-aids.com   10.9832 274.0629    1

So far I tried using cumulative sum for which total column was created then I essentially made the split based on where the mid-point of the total column is. Based on this solution.

test['total'] = test['value'].cumsum()
df_sum = test['value'].sum()//2 
test['bucket'] = np.where(test['total'] <= df_sum, 0,1)

If I try to group them and take the average for each group then the difference is quite significant

display(test.groupby('bucket')['value'].mean())

bucket
0     7.456262
1    10.773905

Is there a way I could achieve this partition based on means instead of sums? I was thinking about using expanding means from pandas but couldn't find a proper way to do it.

CodePudding user response:

I am not sure I understand what you are trying to do, but possibly you want to groupy by quantiles of a column. If so:

test['bucket'] = pd.qcut(test['value'], q=2, labels=False)

which will have bucket=0 for the half of rows with the lesser value values. And 1 for the rest. By tweakign the q parameter you can have as many groups as you want (as long as <= number of rows).

Edit: New attemp, now that I think I understand better your aim:

df = pd.DataFrame( {'value':pd.np.arange(100)})

df['group'] = df['value'].argsort().mod(2)
df.groupby('group')['value'].mean()
# group
# 0    49
# 1    50
# Name: value, dtype: int64
​
df['group'] = df['value'].argsort().mod(3)
df.groupby('group')['value'].mean()
#group
# 0    49.5
# 1    49.0
# 2    50.0
# Name: value, dtype: float64
  • Related