Home > Back-end >  Divide element by sum of groupby in dask without setting index for every column
Divide element by sum of groupby in dask without setting index for every column

Time:05-22

I have code implemented in pandas, but am having trouble converting to dask because I need to use set_index(), what is the best work around? Using dask because I need to scale this to much larger dataframes.

I am looking to return a dataframe where each element is divided by the column-wise sum of a group. Example dataframe that looks like this

df = [
    [1,4,2,1],
    [4,4,0,-1],
    [2,3,1,6],
    [-2,1,0,-1],
    [6,-3,-2,-1],
    [1,0,5,5],
]
df = pd.DataFrame(df)
lab_id = ['a','b','a','b','a','c']
df['lab_id'] = lab_id
df

    0    1    2    3    lab_id 
0   1    4    2    1    a
1   4    4    0   -1    b
2   2    3    1    6    a
3  -2    1    0   -1    b
4   6   -3   -2   -1    a
5   1    0    5    5    c

Currently in pandas I do a groupby by sum to return a dataframe:

sum_df = df.groupby('lab_id').sum()
sum_df

       0    1   2   3
lab_id              
a      9    4   1   6
b      2    5   0   -2
c      1    0   5   5

And then I set the index of the original data frame and divide by the sum dataframe:

df.set_index('lab_id')/sum_df


           0    1        2      3
lab_id              
a   0.111111    1.00     2.0    0.166667
a   0.222222    0.75     1.0    1.000000
a   0.666667    -0.75    -2.0   -0.166667
b   2.000000    0.80     NaN    0.500000
b   -1.000000   0.20     NaN    0.500000
c   1.000000    NaN      1.0    1.000000

The main problem is that I am having a huge issue setting index in dask, which explicitly mentions to avoid using set_index() and reset_index() methods. I simply can't find a way around doing so!

I have tried many arcane ways to set index outside of dask such as creating a new dataframe with the index already set and a row of dummy data and iteratively assigning the columns from the old dataframe (this is some of the worst code i've written).

CodePudding user response:

Try with transform

df.loc[:,[0,1,2,3]] = df/df.groupby('lab_id').transform('sum')[[0,1,2,3]]
df
Out[767]: 
          0     1    2         3 lab_id
0  0.111111  1.00  2.0  0.166667      a
1  2.000000  0.80  NaN  0.500000      b
2  0.222222  0.75  1.0  1.000000      a
3 -1.000000  0.20  NaN  0.500000      b
4  0.666667 -0.75 -2.0 -0.166667      a
5  1.000000   NaN  1.0  1.000000      c
  • Related