Home > Software design >  Rescale pandas column based on a value within that column?
Rescale pandas column based on a value within that column?

Time:12-20

I'm trying to normalize a column of data to 1 based on an internal standard control across several batches of data. However, I'm struggling to do this natively in pandas and not splitting things into multiple chunks with for loops.

import pandas as pd

Test_Data = {"Sample":["Control","Test1","Test2","Test3","Test4","Control","Test1","Test2","Test3","Test4"],
           "Batch":["A","A","A","A","A","B","B","B","B","B"],
           "Input":[0.1,0.15,0.08,0.11,0.2,0.15,0.1,0.04,0.11,0.2],
           "Output":[0.1,0.6,0.08,0.22,0.01,0.08,0.22,0.02,0.13,0.004]}

DB = pd.DataFrame(Test_Data)
DB.loc[:,"Ratio"] = DB["Output"]/DB["Input"]

DB:
    Sample Batch  Input  Output     Ratio
0  Control     A   0.10   0.100  1.000000
1    Test1     A   0.15   0.600  4.000000
2    Test2     A   0.08   0.080  1.000000
3    Test3     A   0.11   0.220  2.000000
4    Test4     A   0.20   0.010  0.050000
5  Control     B   0.15   0.080  0.533333
6    Test1     B   0.10   0.220  2.200000
7    Test2     B   0.04   0.020  0.500000
8    Test3     B   0.11   0.130  1.181818
9    Test4     B   0.20   0.004  0.020000

My desired output would be to normalize each ratio per Batch based on the Control sample, effectively multiplying all the Batch "B" samples by 1.875.

DB:
    Sample Batch  Input  Output     Ratio  Norm_Ratio
0  Control     A   0.10   0.100  1.000000    1.000000
1    Test1     A   0.15   0.600  4.000000    4.000000
2    Test2     A   0.08   0.080  1.000000    1.000000
3    Test3     A   0.11   0.220  2.000000    2.000000
4    Test4     A   0.20   0.010  0.050000    0.050000
5  Control     B   0.15   0.080  0.533333    1.000000
6    Test1     B   0.10   0.220  2.200000    4.125000
7    Test2     B   0.04   0.020  0.500000    0.937500
8    Test3     B   0.11   0.130  1.181818    2.215909
9    Test4     B   0.20   0.004  0.020000    0.037500

I can do this by breaking up the dataframe using for loops and manually extracting the "Control" values, but this is slow and messy for large datasets.

CodePudding user response:

Use where and groupby.transform:

DB['Norm_Ratio'] = DB['Ratio'].div(
                    DB['Ratio'].where(DB['Sample'].eq('Control'))
                     .groupby(DB['Batch']).transform('first')
                    )

Output:

    Sample Batch  Input  Output     Ratio  Norm_Ratio
0  Control     A   0.10   0.100  1.000000    1.000000
1    Test1     A   0.15   0.600  4.000000    4.000000
2    Test2     A   0.08   0.080  1.000000    1.000000
3    Test3     A   0.11   0.220  2.000000    2.000000
4    Test4     A   0.20   0.010  0.050000    0.050000
5  Control     B   0.15   0.080  0.533333    1.000000
6    Test1     B   0.10   0.220  2.200000    4.125000
7    Test2     B   0.04   0.020  0.500000    0.937500
8    Test3     B   0.11   0.130  1.181818    2.215909
9    Test4     B   0.20   0.004  0.020000    0.037500
  • Related