Calculate the mean values of individual rows, based the value of other columns and subtract from oth-CodePudding

I have the following dataframe:

   Book_No  Replicate    Sample  Smell  Taste  Odour  Volatility                    Notes
0   12, 43          1   control    0.3   10.0     71           1                      NaN
1   12, 43          2   control    0.4    8.0     63           3                      NaN
2   12, 43          3   control    0.1    3.0     22           2                      NaN
3   19, 21          1   control    1.1    2.0     80           3                      NaN
4   19, 21          2   control    0.4    8.0      0           4                      NaN
5   19, 21          3   control    0.9    3.0      4           6                      NaN
6   19, 21          4   control    2.1    6.0     50           4                      NaN
7   11, 22          1   control    3.4    3.0     23           3                      NaN
8   12, 43          1  Sample A    1.1   11.2     75           7                      NaN
9   12, 43          2  Sample A    1.4    3.3     87           6  Temperature was too hot
10  12, 43          3  Sample A    0.7    7.4     91           5                      NaN
11  19, 21          1  Sample B    2.1    3.2     99           7                      NaN
12  19, 21          2  Sample B    2.2   11.3     76           8                      NaN
13  19, 21          3  Sample B    1.9    9.3     89           9     sample spilt by user
14  19, 21          1  Sample C    3.2    4.0    112          10                      NaN
15  19, 21          2  Sample C    2.1    5.0     96          15                      NaN
16  19, 21          3  Sample C    2.7    7.0    105          13             Was too cold
17  11, 22          1  Sample C    2.4    3.0    121          19                      NaN

I'd like to do two separate things. Firstly, I'd like to calculate the mean values for each 'smell', 'volatility', 'taste' and 'odour' columns of the 'Sample' Control, where the 'Book_No' is the same value. Then, subtract those mean values from the individual Sample A, Sample B and Sample C, where the 'Book_No' matches those of the control. The resulting dataframe should look something like this:

   Book_No  Replicate    Sample     Smell  Taste  Odour  Volatility                    Notes
0   12, 43          1   control  0.300000  10.00   71.0        1.00                      NaN
1   12, 43          2   control  0.400000   8.00   63.0        3.00                      NaN
2   12, 43          3   control  0.100000   3.00   22.0        2.00                      NaN
3   19, 21          1   control  1.100000   2.00   80.0        3.00                      NaN
4   19, 21          2   control  0.400000   8.00    0.0        4.00                      NaN
5   19, 21          3   control  0.900000   3.00    4.0        6.00                      NaN
6   19, 21          4   control  2.100000   6.00   50.0        4.00                      NaN
7   11, 22          1   control  3.400000   3.00   23.0        3.00                      NaN
8   12, 43          1  Sample A  0.833333   4.20   23.0        5.00                      NaN
9   12, 43          2  Sample A  1.133333  -3.70   35.0        4.00  Temperature was too hot
10  12, 43          3  Sample A  0.433333   0.40   39.0        3.00                      NaN
11  19, 21          1  Sample B  0.975000  -1.55   65.5        2.75                      NaN
12  19, 21          2  Sample B  1.075000   6.55   42.5        3.75                      NaN
13  19, 21          3  Sample B  0.775000   4.55   55.5        4.75     sample spilt by user
14  19, 21          1  Sample C -0.200000   1.00   89.0        7.00                      NaN
15  19, 21          2  Sample C -1.300000   2.00   73.0       12.00                      NaN
16  19, 21          3  Sample C -0.700000   4.00   82.0       10.00             Was too cold
17  11, 22          1  Sample C -1.000000   0.00   98.0       16.00                      NaN

I've tried the following codes, but neither seems to give me what I need, plus I'd need to copy and paste the code and change the column name for each column I'd like to apply it to:

df['Smell'] = df['Smell'] - df.groupby(['Book_No', 'Sample'])['Smell'].transform('mean')

and I've tried to apply a mask:

mask = df['Book_No'].unique()
df.loc[~mask, 'Smell'] = (df['Smell'] - df['Smell'].where(mask).groupby([df['Book_No'],df['Sample']]).transform('mean'))

Then, separately, I'd like to subtract the control values from the sample values, when the Book_No and replicate values match. The resulting dataframe should look something like this:

   Book_No  Replicate    Sample  Smell  Taste  Odour  Volatility               Unnamed: 7
0   12, 43          1   control    0.3   10.0     71           1                      NaN
1   12, 43          2   control    0.4    8.0     63           3                      NaN
2   12, 43          3   control    0.1    3.0     22           2                      NaN
3   19, 21          1   control    1.1    2.0     80           3                      NaN
4   19, 21          2   control    0.4    8.0      0           4                      NaN
5   19, 21          3   control    0.9    3.0      4           6                      NaN
6   19, 21          4   control    2.1    6.0     50           4                      NaN
7   11, 22          1   control    3.4    3.0     23           3                      NaN
8   12, 43          1  Sample A    0.8    1.2      4           6                      NaN
9   12, 43          2  Sample A    1.0   -4.7     24           3  Temperature was too hot
10  12, 43          3  Sample A    0.6    4.4     69           3                      NaN
11  19, 21          1  Sample B    1.0    1.2     19           4                      NaN
12  19, 21          2  Sample B    1.8    3.3     76           4                      NaN
13  19, 21          3  Sample B    1.0    6.3     85           3     sample spilt by user
14  19, 21          1  Sample C    2.1    2.0     32           7                      NaN
15  19, 21          2  Sample C    1.7   -3.0     96          11                      NaN
16  19, 21          3  Sample C    1.8    4.0    101           7             Was too cold
17  11, 22          1  Sample C   -1.0    0.0     98          16                      NaN

Could anyone kindly offer their assistance to help with these two scenarios?

Thank you in advance for any help

CodePudding user response：

Splitting into different columns and reordering:

# This may be useful to you in the future, plus, ints are better than strings:
df[['Book', 'No']] = df.Book_No.str.split(', ', expand=True).astype(int)
cols = df.columns.tolist()
df = df[cols[-2:]   cols[1:-2]]

You should only focus on one problem at a time in your questions, so I'll help with the first part.

# Set some vars so we don't have to type these over and over:
cols = ['Smell', 'Volatility', 'Taste', 'Odour']
mask = df.Sample.eq('control')
group = ['Book', 'No']

# Find your control values:
ctrl_means = df[mask].groupby(group)[cols].mean()

# Apply your desired change:
df.loc[~mask, cols] = (df[~mask].groupby(group)[cols]
                                .apply(lambda x: x.sub(ctrl_means.loc[x.name])))
print(df)

Output:

    Book  No  Replicate    Sample     Smell  Taste  Odour  Volatility                    Notes
0     12  43          1   control  0.300000  10.00   71.0        1.00                      NaN
1     12  43          2   control  0.400000   8.00   63.0        3.00                      NaN
2     12  43          3   control  0.100000   3.00   22.0        2.00                      NaN
3     19  21          1   control  1.100000   2.00   80.0        3.00                      NaN
4     19  21          2   control  0.400000   8.00    0.0        4.00                      NaN
5     19  21          3   control  0.900000   3.00    4.0        6.00                      NaN
6     19  21          4   control  2.100000   6.00   50.0        4.00                      NaN
7     11  22          1   control  3.400000   3.00   23.0        3.00                      NaN
8     12  43          1  Sample A  0.833333   4.20   23.0        5.00                      NaN
9     12  43          2  Sample A  1.133333  -3.70   35.0        4.00  Temperature was too hot
10    12  43          3  Sample A  0.433333   0.40   39.0        3.00                      NaN
11    19  21          1  Sample B  0.975000  -1.55   65.5        2.75                      NaN
12    19  21          2  Sample B  1.075000   6.55   42.5        3.75                      NaN
13    19  21          3  Sample B  0.775000   4.55   55.5        4.75     sample spilt by user
14    19  21          1  Sample C  2.075000  -0.75   78.5        5.75                      NaN
15    19  21          2  Sample C  0.975000   0.25   62.5       10.75                      NaN
16    19  21          3  Sample C  1.575000   2.25   71.5        8.75             Was too cold
17    11  22          1  Sample C -1.000000   0.00   98.0       16.00                      NaN

CodePudding user response：

First we get the mean of the control samples:

cols = ['Smell', 'Taste', 'Odour', 'Volatility']
control_means = df[df.Sample.eq('control')].groupby(['Book_No'])[cols].mean()

Then subtract it from the remaining samples to get the fixed sample data. To utilize pandas automatic alignment, we need to temporarily set the index:

new_idx = ['Book_No', df.index]
fixed_samples = (df.set_index(new_idx).loc[df.set_index(new_idx).Sample.ne('control'), cols]
                 - control_means).droplevel(0)

Finally simply assign them back into the dataframe:

df.loc[df.Sample.ne('control'), cols] = fixed_samples

Result:

   Book_No  Replicate    Sample     Smell  Taste  Odour  Volatility                    Notes
0   12, 43          1   control  0.300000  10.00   71.0        1.00                      NaN
1   12, 43          2   control  0.400000   8.00   63.0        3.00                      NaN
2   12, 43          3   control  0.100000   3.00   22.0        2.00                      NaN
3   19, 21          1   control  1.100000   2.00   80.0        3.00                      NaN
4   19, 21          2   control  0.400000   8.00    0.0        4.00                      NaN
5   19, 21          3   control  0.900000   3.00    4.0        6.00                      NaN
6   19, 21          4   control  2.100000   6.00   50.0        4.00                      NaN
7   11, 22          1   control  3.400000   3.00   23.0        3.00                      NaN
8   12, 43          1  Sample A  0.833333   4.20   23.0        5.00                      NaN
9   12, 43          2  Sample A  1.133333  -3.70   35.0        4.00  Temperature was too hot
10  12, 43          3  Sample A  0.433333   0.40   39.0        3.00                      NaN
11  19, 21          1  Sample B  0.975000  -1.55   65.5        2.75                      NaN
12  19, 21          2  Sample B  1.075000   6.55   42.5        3.75                      NaN
13  19, 21          3  Sample B  0.775000   4.55   55.5        4.75     sample spilt by user
14  19, 21          1  Sample C  2.075000  -0.75   78.5        5.75                      NaN
15  19, 21          2  Sample C  0.975000   0.25   62.5       10.75                      NaN
16  19, 21          3  Sample C  1.575000   2.25   71.5        8.75             Was too cold
17  11, 22          1  Sample C -1.000000   0.00   98.0       16.00                      NaN

If you want you can squeeze it into a one-liner, but this his hardly comprehensible:

cols = ['Smell', 'Taste', 'Odour', 'Volatility']
new_idx = ['Book_No', df.index]

df.loc[df.Sample.ne('control'), cols] = (
    df.set_index(new_idx).loc[df.set_index(new_idx).Sample.ne('control'), cols]
    - df[df.Sample.eq('control')].groupby(['Book_No'])[cols].mean()
    ).droplevel(0)