groupby with str.contains in different column-CodePudding

df_batches = pd.DataFrame({"Item":['Can','Liquid','Label','Liquid2', 'Can', 'Sugar'],
                           "Unit_cost": [.14, .45, .17, 1, .16, 23],
                           "Product":['Prod1', 'Prod1', 'Prod1', 'Prod1', 'Prod1', 'Prod1'],
                           "Batch":['5,100','5,100','5,100','5,101','5,101','5,101'], 
                           "Total_cost": [452, 789.34, 11115.28, 3220.98, 3542, 512.34],
                           "Year": ['2019', '2019', '2019', '2019', '2019', '2019'],
                          })

I would like to create a df using these rules:

If any item in the batch str.contains('Label'), then get those batches in new groupby df.

Tried this:

label = df['Item'].str.contains('Label')
label_df = df[label].groupby(['Product', 'Batch', 'Item', 'Year'])[['Total_cost']].sum().reset_index()

But this only gets the str.contains('Label') rows.

I am assuming I need to use .transform() somewhere, but can't figure it out.

Expected output would be this:

df_output = pd.DataFrame({"Item":['Can','Liquid','Label'],
                           "Unit_cost": [.14, .45, .17],
                           "Product":['Prod1', 'Prod1', 'Prod1'],
                           "Batch":['5,100','5,100','5,100',], 
                           "Total_cost": [452, 789.34, 11115.28],
                           "Year": ['2019', '2019', '2019'],
                          })
label_df = df_output.groupby(['Product','Batch', 'Year', 'Item'])[['Total_cost']].sum().reset_index()

Basically, if the 'Batch' doesn't have and item containing label it is filtered out.

CodePudding user response：

IIUC you want to retrieve the Batch number:

s = df.loc[df['Item'].str.contains('Label'), "Batch"].iat[0]

print (df.loc[df["Batch"]==s])

     Item  Unit_cost Product  Batch  Total_cost  Year
0     Can       0.14   Prod1  5,100      452.00  2019
1  Liquid       0.45   Prod1  5,100      789.34  2019
2   Label       0.17   Prod1  5,100    11115.28  2019

If you have more than 1 Label:

s = df.loc[df['Item'].str.contains('Label'), "Batch"]

print (df.loc[df["Batch"].isin(s)])