Running sums in pandas with row specification-CodePudding

I have some data where I am trying to calculate the total measurement for all counts and sum of the measurements for counts 2, 3, and 4, for each lot item number of each lot. Ideally, I would have 2 extra columns on the original data that has the total measurement and measurements for counts 2, 3, and 4-- even though those values would be duplicated, they would be represented with each record. Here is an example of the dataset:

Date    Sample Type Lot #   Lot item #  Count   Measurement
0   2021-12-05  G   ABS123-G    1   1   5.0
1   2021-12-05  G   ABS123-G    1   2   3.0
2   2021-12-05  G   ABS123-G    1   3   7.0
3   2021-12-05  G   ABS123-G    1   4   25.1
4   2021-12-05  G   ABS123-G    1   5   66.0
5   2021-12-05  G   ABS123-G    1   6   54.0
6   2021-12-05  G   ABS123-G    1   7   12.0
7   2021-12-05  G   ABS123-G    1   8   0.0
8   2021-12-05  G   ABS123-G    1   9   1.0
9   2021-12-05  G   ABS123-G    1   10  5.0
10  2021-12-05  G   ABS123-G    2   1   2.0
11  2021-12-05  G   ABS123-G    2   2   4.0
12  2021-12-05  G   ABS123-G    2   3   889.0
13  2021-12-05  G   ABS123-G    2   4   12.4
14  2021-12-05  G   ABS123-G    2   5   51.4
15  2021-12-05  G   ABS123-G    2   6   12.0
16  2021-12-05  G   ABS123-G    2   7   14.0
17  2021-12-05  G   ABS123-G    2   8   2.0
18  2021-12-05  G   ABS123-G    2   9   1.0
19  2021-12-05  G   ABS123-G    2   10  0.1
20  2021-12-05  B   ABS123-B    1   1   4.0
21  2021-12-05  B   ABS123-B    1   2   58.0
22  2021-12-05  B   ABS123-B    1   3   123.0
23  2021-12-05  B   ABS123-B    1   4   12.5
24  2021-12-05  B   ABS123-B    1   5   11.0
25  2021-12-05  B   ABS123-B    1   6   135.5
26  2021-12-05  B   ABS123-B    1   7   17.0
27  2021-12-05  B   ABS123-B    1   8   1.0
28  2021-12-05  B   ABS123-B    1   9   5.0
29  2021-12-05  B   ABS123-B    1   10  0.3

My approach was to try and filter the counts to 2,3, 4, calculate the sum, and then join the df to the original based on the lot and lot item # and then do something similar with the total. However, I am running into an error when I trying to sum.

df2 = df.loc[(df['Count'] == 2) | (df['Count'] == 3) | (df['Count'] == 4)]
df2['Counts 2,3,4'] = df2.grouby(['Lot #, 'Lot item #'])['Measurement'].sum()
df2

TypeError: incompatible index of inserted column with frame index

The filter worked, but not the second part. First, I don't know what is causing the error, do I need to reset the index? Also, is this the correct approach? Any suggestions are welcomed.

CodePudding user response：

I was able to figure it out because it was a problem with the index. When I just removed the new column name and just ran the groupby, it worked. Then I reset the index to the groupby and I was able merge to the original df with no issues. The same with the total.

df2 = df.loc[(df['Count'] == 2) | (df['Count'] == 3) | (df['Count'] == 4)]
df3 = df2.groupby(['Lot #', 'Lot item #'])['Measurement'].sum()
df3 = df3.reset_index()

joined = pd.merge(df, df3, how='left', left_on=['Lot #', 'Lot item #'], right_on=['Lot #', 'Lot item #'])

I just think there must be a more elegant solution out there than this? But maybe not?

CodePudding user response：

We can use isin to simplify the equality checks by defining a list of integer values. We can then use join after groupby sum and specify the columns to join on. Lastly rename the Series to the new column name:

# Specify columns to groupby and join back on
grp_cols = ['Lot #', 'Lot item #']
joined = df.join(
    df[df['Count'].isin([2, 3, 4])]  # Values to include
        .groupby(grp_cols)['Measurement'].sum()  # Take sum per group
        .rename('Counts 2,3,4'),  # Specify new column name
    on=grp_cols,
)

joined:

          Date Sample Type     Lot #  Lot item #  Count  Measurement  Counts 2,3,4
0   2021-12-05           G  ABS123-G           1      1          5.0          35.1
1   2021-12-05           G  ABS123-G           1      2          3.0          35.1
2   2021-12-05           G  ABS123-G           1      3          7.0          35.1
3   2021-12-05           G  ABS123-G           1      4         25.1          35.1
4   2021-12-05           G  ABS123-G           1      5         66.0          35.1
5   2021-12-05           G  ABS123-G           1      6         54.0          35.1
6   2021-12-05           G  ABS123-G           1      7         12.0          35.1
7   2021-12-05           G  ABS123-G           1      8          0.0          35.1
8   2021-12-05           G  ABS123-G           1      9          1.0          35.1
9   2021-12-05           G  ABS123-G           1     10          5.0          35.1
10  2021-12-05           G  ABS123-G           2      1          2.0         905.4
11  2021-12-05           G  ABS123-G           2      2          4.0         905.4
12  2021-12-05           G  ABS123-G           2      3        889.0         905.4
13  2021-12-05           G  ABS123-G           2      4         12.4         905.4
14  2021-12-05           G  ABS123-G           2      5         51.4         905.4
15  2021-12-05           G  ABS123-G           2      6         12.0         905.4
16  2021-12-05           G  ABS123-G           2      7         14.0         905.4
17  2021-12-05           G  ABS123-G           2      8          2.0         905.4
18  2021-12-05           G  ABS123-G           2      9          1.0         905.4
19  2021-12-05           G  ABS123-G           2     10          0.1         905.4
20  2021-12-05           B  ABS123-B           1      1          4.0         193.5
21  2021-12-05           B  ABS123-B           1      2         58.0         193.5
22  2021-12-05           B  ABS123-B           1      3        123.0         193.5
23  2021-12-05           B  ABS123-B           1      4         12.5         193.5
24  2021-12-05           B  ABS123-B           1      5         11.0         193.5
25  2021-12-05           B  ABS123-B           1      6        135.5         193.5
26  2021-12-05           B  ABS123-B           1      7         17.0         193.5
27  2021-12-05           B  ABS123-B           1      8          1.0         193.5
28  2021-12-05           B  ABS123-B           1      9          5.0         193.5
29  2021-12-05           B  ABS123-B           1     10          0.3         193.5

Sample DataFrame constructor:

import pandas as pd

df = pd.DataFrame({
    'Date': pd.to_datetime(
        ['2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
         '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
         '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
         '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
         '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
         '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05']),
    'Sample Type': ['G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G',
                    'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'B', 'B', 'B', 'B',
                    'B', 'B', 'B', 'B', 'B', 'B'],
    'Lot #': ['ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
              'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
              'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
              'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
              'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B',
              'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B'],
    'Lot item #': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
                   1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    'Count': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1,
              2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Measurement': [5.0, 3.0, 7.0, 25.1, 66.0, 54.0, 12.0, 0.0, 1.0, 5.0, 2.0,
                    4.0, 889.0, 12.4, 51.4, 12.0, 14.0, 2.0, 1.0, 0.1, 4.0,
                    58.0, 123.0, 12.5, 11.0, 135.5, 17.0, 1.0, 5.0, 0.3]
})