I have some data where I am trying to calculate the total measurement for all counts and sum of the measurements for counts 2, 3, and 4, for each lot item number of each lot. Ideally, I would have 2 extra columns on the original data that has the total measurement and measurements for counts 2, 3, and 4-- even though those values would be duplicated, they would be represented with each record. Here is an example of the dataset:
Date Sample Type Lot # Lot item # Count Measurement
0 2021-12-05 G ABS123-G 1 1 5.0
1 2021-12-05 G ABS123-G 1 2 3.0
2 2021-12-05 G ABS123-G 1 3 7.0
3 2021-12-05 G ABS123-G 1 4 25.1
4 2021-12-05 G ABS123-G 1 5 66.0
5 2021-12-05 G ABS123-G 1 6 54.0
6 2021-12-05 G ABS123-G 1 7 12.0
7 2021-12-05 G ABS123-G 1 8 0.0
8 2021-12-05 G ABS123-G 1 9 1.0
9 2021-12-05 G ABS123-G 1 10 5.0
10 2021-12-05 G ABS123-G 2 1 2.0
11 2021-12-05 G ABS123-G 2 2 4.0
12 2021-12-05 G ABS123-G 2 3 889.0
13 2021-12-05 G ABS123-G 2 4 12.4
14 2021-12-05 G ABS123-G 2 5 51.4
15 2021-12-05 G ABS123-G 2 6 12.0
16 2021-12-05 G ABS123-G 2 7 14.0
17 2021-12-05 G ABS123-G 2 8 2.0
18 2021-12-05 G ABS123-G 2 9 1.0
19 2021-12-05 G ABS123-G 2 10 0.1
20 2021-12-05 B ABS123-B 1 1 4.0
21 2021-12-05 B ABS123-B 1 2 58.0
22 2021-12-05 B ABS123-B 1 3 123.0
23 2021-12-05 B ABS123-B 1 4 12.5
24 2021-12-05 B ABS123-B 1 5 11.0
25 2021-12-05 B ABS123-B 1 6 135.5
26 2021-12-05 B ABS123-B 1 7 17.0
27 2021-12-05 B ABS123-B 1 8 1.0
28 2021-12-05 B ABS123-B 1 9 5.0
29 2021-12-05 B ABS123-B 1 10 0.3
My approach was to try and filter the counts to 2,3, 4, calculate the sum, and then join the df to the original based on the lot and lot item # and then do something similar with the total. However, I am running into an error when I trying to sum.
df2 = df.loc[(df['Count'] == 2) | (df['Count'] == 3) | (df['Count'] == 4)]
df2['Counts 2,3,4'] = df2.grouby(['Lot #, 'Lot item #'])['Measurement'].sum()
df2
TypeError: incompatible index of inserted column with frame index
The filter worked, but not the second part. First, I don't know what is causing the error, do I need to reset the index? Also, is this the correct approach? Any suggestions are welcomed.
CodePudding user response:
I was able to figure it out because it was a problem with the index. When I just removed the new column name and just ran the groupby, it worked. Then I reset the index to the groupby and I was able merge to the original df with no issues. The same with the total.
df2 = df.loc[(df['Count'] == 2) | (df['Count'] == 3) | (df['Count'] == 4)]
df3 = df2.groupby(['Lot #', 'Lot item #'])['Measurement'].sum()
df3 = df3.reset_index()
joined = pd.merge(df, df3, how='left', left_on=['Lot #', 'Lot item #'], right_on=['Lot #', 'Lot item #'])
I just think there must be a more elegant solution out there than this? But maybe not?
CodePudding user response:
We can use isin
to simplify the equality checks by defining a list of integer values. We can then use join
after groupby sum and specify the columns to join on. Lastly rename
the Series to the new column name:
# Specify columns to groupby and join back on
grp_cols = ['Lot #', 'Lot item #']
joined = df.join(
df[df['Count'].isin([2, 3, 4])] # Values to include
.groupby(grp_cols)['Measurement'].sum() # Take sum per group
.rename('Counts 2,3,4'), # Specify new column name
on=grp_cols,
)
joined
:
Date Sample Type Lot # Lot item # Count Measurement Counts 2,3,4
0 2021-12-05 G ABS123-G 1 1 5.0 35.1
1 2021-12-05 G ABS123-G 1 2 3.0 35.1
2 2021-12-05 G ABS123-G 1 3 7.0 35.1
3 2021-12-05 G ABS123-G 1 4 25.1 35.1
4 2021-12-05 G ABS123-G 1 5 66.0 35.1
5 2021-12-05 G ABS123-G 1 6 54.0 35.1
6 2021-12-05 G ABS123-G 1 7 12.0 35.1
7 2021-12-05 G ABS123-G 1 8 0.0 35.1
8 2021-12-05 G ABS123-G 1 9 1.0 35.1
9 2021-12-05 G ABS123-G 1 10 5.0 35.1
10 2021-12-05 G ABS123-G 2 1 2.0 905.4
11 2021-12-05 G ABS123-G 2 2 4.0 905.4
12 2021-12-05 G ABS123-G 2 3 889.0 905.4
13 2021-12-05 G ABS123-G 2 4 12.4 905.4
14 2021-12-05 G ABS123-G 2 5 51.4 905.4
15 2021-12-05 G ABS123-G 2 6 12.0 905.4
16 2021-12-05 G ABS123-G 2 7 14.0 905.4
17 2021-12-05 G ABS123-G 2 8 2.0 905.4
18 2021-12-05 G ABS123-G 2 9 1.0 905.4
19 2021-12-05 G ABS123-G 2 10 0.1 905.4
20 2021-12-05 B ABS123-B 1 1 4.0 193.5
21 2021-12-05 B ABS123-B 1 2 58.0 193.5
22 2021-12-05 B ABS123-B 1 3 123.0 193.5
23 2021-12-05 B ABS123-B 1 4 12.5 193.5
24 2021-12-05 B ABS123-B 1 5 11.0 193.5
25 2021-12-05 B ABS123-B 1 6 135.5 193.5
26 2021-12-05 B ABS123-B 1 7 17.0 193.5
27 2021-12-05 B ABS123-B 1 8 1.0 193.5
28 2021-12-05 B ABS123-B 1 9 5.0 193.5
29 2021-12-05 B ABS123-B 1 10 0.3 193.5
Sample DataFrame constructor:
import pandas as pd
df = pd.DataFrame({
'Date': pd.to_datetime(
['2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
'2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
'2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
'2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
'2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
'2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05']),
'Sample Type': ['G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G',
'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'B', 'B', 'B', 'B',
'B', 'B', 'B', 'B', 'B', 'B'],
'Lot #': ['ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B',
'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B'],
'Lot item #': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'Count': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1,
2, 3, 4, 5, 6, 7, 8, 9, 10],
'Measurement': [5.0, 3.0, 7.0, 25.1, 66.0, 54.0, 12.0, 0.0, 1.0, 5.0, 2.0,
4.0, 889.0, 12.4, 51.4, 12.0, 14.0, 2.0, 1.0, 0.1, 4.0,
58.0, 123.0, 12.5, 11.0, 135.5, 17.0, 1.0, 5.0, 0.3]
})