I have code implemented in pandas, but am having trouble converting to dask because I need to use set_index()
, what is the best work around? Using dask because I need to scale this to much larger dataframes.
I am looking to return a dataframe where each element is divided by the column-wise sum of a group. Example dataframe that looks like this
df = [
[1,4,2,1],
[4,4,0,-1],
[2,3,1,6],
[-2,1,0,-1],
[6,-3,-2,-1],
[1,0,5,5],
]
df = pd.DataFrame(df)
lab_id = ['a','b','a','b','a','c']
df['lab_id'] = lab_id
df
0 1 2 3 lab_id
0 1 4 2 1 a
1 4 4 0 -1 b
2 2 3 1 6 a
3 -2 1 0 -1 b
4 6 -3 -2 -1 a
5 1 0 5 5 c
Currently in pandas I do a groupby by sum to return a dataframe:
sum_df = df.groupby('lab_id').sum()
sum_df
0 1 2 3
lab_id
a 9 4 1 6
b 2 5 0 -2
c 1 0 5 5
And then I set the index of the original data frame and divide by the sum dataframe:
df.set_index('lab_id')/sum_df
0 1 2 3
lab_id
a 0.111111 1.00 2.0 0.166667
a 0.222222 0.75 1.0 1.000000
a 0.666667 -0.75 -2.0 -0.166667
b 2.000000 0.80 NaN 0.500000
b -1.000000 0.20 NaN 0.500000
c 1.000000 NaN 1.0 1.000000
The main problem is that I am having a huge issue setting index in dask, which explicitly mentions to avoid using set_index()
and reset_index()
methods. I simply can't find a way around doing so!
I have tried many arcane ways to set index outside of dask such as creating a new dataframe with the index already set and a row of dummy data and iteratively assigning the columns from the old dataframe (this is some of the worst code i've written).
CodePudding user response:
Try with transform
df.loc[:,[0,1,2,3]] = df/df.groupby('lab_id').transform('sum')[[0,1,2,3]]
df
Out[767]:
0 1 2 3 lab_id
0 0.111111 1.00 2.0 0.166667 a
1 2.000000 0.80 NaN 0.500000 b
2 0.222222 0.75 1.0 1.000000 a
3 -1.000000 0.20 NaN 0.500000 b
4 0.666667 -0.75 -2.0 -0.166667 a
5 1.000000 NaN 1.0 1.000000 c