I have the following dataframe:
Book_No Replicate Sample Smell Taste Odour Volatility Notes
0 12, 43 1 control 0.3 10.0 71 1 NaN
1 12, 43 2 control 0.4 8.0 63 3 NaN
2 12, 43 3 control 0.1 3.0 22 2 NaN
3 19, 21 1 control 1.1 2.0 80 3 NaN
4 19, 21 2 control 0.4 8.0 0 4 NaN
5 19, 21 3 control 0.9 3.0 4 6 NaN
6 19, 21 4 control 2.1 6.0 50 4 NaN
7 11, 22 1 control 3.4 3.0 23 3 NaN
8 12, 43 1 Sample A 1.1 11.2 75 7 NaN
9 12, 43 2 Sample A 1.4 3.3 87 6 Temperature was too hot
10 12, 43 3 Sample A 0.7 7.4 91 5 NaN
11 19, 21 1 Sample B 2.1 3.2 99 7 NaN
12 19, 21 2 Sample B 2.2 11.3 76 8 NaN
13 19, 21 3 Sample B 1.9 9.3 89 9 sample spilt by user
14 19, 21 1 Sample C 3.2 4.0 112 10 NaN
15 19, 21 2 Sample C 2.1 5.0 96 15 NaN
16 19, 21 3 Sample C 2.7 7.0 105 13 Was too cold
17 11, 22 1 Sample C 2.4 3.0 121 19 NaN
I'd like to do two separate things. Firstly, I'd like to calculate the mean values for each 'smell', 'volatility', 'taste' and 'odour' columns of the 'Sample' Control, where the 'Book_No' is the same value. Then, subtract those mean values from the individual Sample A, Sample B and Sample C, where the 'Book_No' matches those of the control. The resulting dataframe should look something like this:
Book_No Replicate Sample Smell Taste Odour Volatility Notes
0 12, 43 1 control 0.300000 10.00 71.0 1.00 NaN
1 12, 43 2 control 0.400000 8.00 63.0 3.00 NaN
2 12, 43 3 control 0.100000 3.00 22.0 2.00 NaN
3 19, 21 1 control 1.100000 2.00 80.0 3.00 NaN
4 19, 21 2 control 0.400000 8.00 0.0 4.00 NaN
5 19, 21 3 control 0.900000 3.00 4.0 6.00 NaN
6 19, 21 4 control 2.100000 6.00 50.0 4.00 NaN
7 11, 22 1 control 3.400000 3.00 23.0 3.00 NaN
8 12, 43 1 Sample A 0.833333 4.20 23.0 5.00 NaN
9 12, 43 2 Sample A 1.133333 -3.70 35.0 4.00 Temperature was too hot
10 12, 43 3 Sample A 0.433333 0.40 39.0 3.00 NaN
11 19, 21 1 Sample B 0.975000 -1.55 65.5 2.75 NaN
12 19, 21 2 Sample B 1.075000 6.55 42.5 3.75 NaN
13 19, 21 3 Sample B 0.775000 4.55 55.5 4.75 sample spilt by user
14 19, 21 1 Sample C -0.200000 1.00 89.0 7.00 NaN
15 19, 21 2 Sample C -1.300000 2.00 73.0 12.00 NaN
16 19, 21 3 Sample C -0.700000 4.00 82.0 10.00 Was too cold
17 11, 22 1 Sample C -1.000000 0.00 98.0 16.00 NaN
I've tried the following codes, but neither seems to give me what I need, plus I'd need to copy and paste the code and change the column name for each column I'd like to apply it to:
df['Smell'] = df['Smell'] - df.groupby(['Book_No', 'Sample'])['Smell'].transform('mean')
and I've tried to apply a mask:
mask = df['Book_No'].unique()
df.loc[~mask, 'Smell'] = (df['Smell'] - df['Smell'].where(mask).groupby([df['Book_No'],df['Sample']]).transform('mean'))
Then, separately, I'd like to subtract the control values from the sample values, when the Book_No and replicate values match. The resulting dataframe should look something like this:
Book_No Replicate Sample Smell Taste Odour Volatility Unnamed: 7
0 12, 43 1 control 0.3 10.0 71 1 NaN
1 12, 43 2 control 0.4 8.0 63 3 NaN
2 12, 43 3 control 0.1 3.0 22 2 NaN
3 19, 21 1 control 1.1 2.0 80 3 NaN
4 19, 21 2 control 0.4 8.0 0 4 NaN
5 19, 21 3 control 0.9 3.0 4 6 NaN
6 19, 21 4 control 2.1 6.0 50 4 NaN
7 11, 22 1 control 3.4 3.0 23 3 NaN
8 12, 43 1 Sample A 0.8 1.2 4 6 NaN
9 12, 43 2 Sample A 1.0 -4.7 24 3 Temperature was too hot
10 12, 43 3 Sample A 0.6 4.4 69 3 NaN
11 19, 21 1 Sample B 1.0 1.2 19 4 NaN
12 19, 21 2 Sample B 1.8 3.3 76 4 NaN
13 19, 21 3 Sample B 1.0 6.3 85 3 sample spilt by user
14 19, 21 1 Sample C 2.1 2.0 32 7 NaN
15 19, 21 2 Sample C 1.7 -3.0 96 11 NaN
16 19, 21 3 Sample C 1.8 4.0 101 7 Was too cold
17 11, 22 1 Sample C -1.0 0.0 98 16 NaN
Could anyone kindly offer their assistance to help with these two scenarios?
Thank you in advance for any help
CodePudding user response:
Splitting into different columns and reordering:
# This may be useful to you in the future, plus, ints are better than strings:
df[['Book', 'No']] = df.Book_No.str.split(', ', expand=True).astype(int)
cols = df.columns.tolist()
df = df[cols[-2:] cols[1:-2]]
You should only focus on one problem at a time in your questions, so I'll help with the first part.
# Set some vars so we don't have to type these over and over:
cols = ['Smell', 'Volatility', 'Taste', 'Odour']
mask = df.Sample.eq('control')
group = ['Book', 'No']
# Find your control values:
ctrl_means = df[mask].groupby(group)[cols].mean()
# Apply your desired change:
df.loc[~mask, cols] = (df[~mask].groupby(group)[cols]
.apply(lambda x: x.sub(ctrl_means.loc[x.name])))
print(df)
Output:
Book No Replicate Sample Smell Taste Odour Volatility Notes
0 12 43 1 control 0.300000 10.00 71.0 1.00 NaN
1 12 43 2 control 0.400000 8.00 63.0 3.00 NaN
2 12 43 3 control 0.100000 3.00 22.0 2.00 NaN
3 19 21 1 control 1.100000 2.00 80.0 3.00 NaN
4 19 21 2 control 0.400000 8.00 0.0 4.00 NaN
5 19 21 3 control 0.900000 3.00 4.0 6.00 NaN
6 19 21 4 control 2.100000 6.00 50.0 4.00 NaN
7 11 22 1 control 3.400000 3.00 23.0 3.00 NaN
8 12 43 1 Sample A 0.833333 4.20 23.0 5.00 NaN
9 12 43 2 Sample A 1.133333 -3.70 35.0 4.00 Temperature was too hot
10 12 43 3 Sample A 0.433333 0.40 39.0 3.00 NaN
11 19 21 1 Sample B 0.975000 -1.55 65.5 2.75 NaN
12 19 21 2 Sample B 1.075000 6.55 42.5 3.75 NaN
13 19 21 3 Sample B 0.775000 4.55 55.5 4.75 sample spilt by user
14 19 21 1 Sample C 2.075000 -0.75 78.5 5.75 NaN
15 19 21 2 Sample C 0.975000 0.25 62.5 10.75 NaN
16 19 21 3 Sample C 1.575000 2.25 71.5 8.75 Was too cold
17 11 22 1 Sample C -1.000000 0.00 98.0 16.00 NaN
CodePudding user response:
First we get the mean of the control samples:
cols = ['Smell', 'Taste', 'Odour', 'Volatility']
control_means = df[df.Sample.eq('control')].groupby(['Book_No'])[cols].mean()
Then subtract it from the remaining samples to get the fixed sample data. To utilize pandas automatic alignment, we need to temporarily set the index:
new_idx = ['Book_No', df.index]
fixed_samples = (df.set_index(new_idx).loc[df.set_index(new_idx).Sample.ne('control'), cols]
- control_means).droplevel(0)
Finally simply assign them back into the dataframe:
df.loc[df.Sample.ne('control'), cols] = fixed_samples
Result:
Book_No Replicate Sample Smell Taste Odour Volatility Notes
0 12, 43 1 control 0.300000 10.00 71.0 1.00 NaN
1 12, 43 2 control 0.400000 8.00 63.0 3.00 NaN
2 12, 43 3 control 0.100000 3.00 22.0 2.00 NaN
3 19, 21 1 control 1.100000 2.00 80.0 3.00 NaN
4 19, 21 2 control 0.400000 8.00 0.0 4.00 NaN
5 19, 21 3 control 0.900000 3.00 4.0 6.00 NaN
6 19, 21 4 control 2.100000 6.00 50.0 4.00 NaN
7 11, 22 1 control 3.400000 3.00 23.0 3.00 NaN
8 12, 43 1 Sample A 0.833333 4.20 23.0 5.00 NaN
9 12, 43 2 Sample A 1.133333 -3.70 35.0 4.00 Temperature was too hot
10 12, 43 3 Sample A 0.433333 0.40 39.0 3.00 NaN
11 19, 21 1 Sample B 0.975000 -1.55 65.5 2.75 NaN
12 19, 21 2 Sample B 1.075000 6.55 42.5 3.75 NaN
13 19, 21 3 Sample B 0.775000 4.55 55.5 4.75 sample spilt by user
14 19, 21 1 Sample C 2.075000 -0.75 78.5 5.75 NaN
15 19, 21 2 Sample C 0.975000 0.25 62.5 10.75 NaN
16 19, 21 3 Sample C 1.575000 2.25 71.5 8.75 Was too cold
17 11, 22 1 Sample C -1.000000 0.00 98.0 16.00 NaN
If you want you can squeeze it into a one-liner, but this his hardly comprehensible:
cols = ['Smell', 'Taste', 'Odour', 'Volatility']
new_idx = ['Book_No', df.index]
df.loc[df.Sample.ne('control'), cols] = (
df.set_index(new_idx).loc[df.set_index(new_idx).Sample.ne('control'), cols]
- df[df.Sample.eq('control')].groupby(['Book_No'])[cols].mean()
).droplevel(0)