Python DataFrame Multiple Column Groupby With Max Value From First Group-CodePudding

Data Frame is:

df=pd.DataFrame([[1,'a','c1',30,'s1','e1'],[1,'b','c1',60,'s1','e1'],[1,'b','c2',40,'s1','e1'],[2,'g','c1',40,'s2','e2'],[2,'g','c3',9,'s1','e1'],[3,'k','c2',20,'s1','e1'],[3,'k','c2',69,'s2','e1'],[3,'k','c1',29,'s1','e1'],[3,'f','c3',99,'s2','e1']], columns = ['Lot','Item','Code','Date','Shelf','Emp'])

     Lot Item Code  Date Shelf Emp
0    1    a   c1    30    s1  e1
1    1    b   c1    60    s1  e1
2    1    b   c2    40    s1  e1
3    2    g   c1    40    s2  e2
4    2    g   c3     9    s1  e1
5    3    k   c2    20    s1  e1
6    3    k   c2    69    s2  e1
7    3    k   c1    29    s1  e1
8    3    f   c3    99    s2  e1

I need to

Group By Lot and then Item
Get the First Item in each Lot
Finally Get the Max Date for each First Item in Each Lot

The OUTPUT should be:

     Lot Item  Code Date Shelf Emp
0    1    a    c1   30   s1    e1
3    2    g    c2   40   s2    e1
6    3    k    c2   69   s2    e1

Thank you !!!

CodePudding user response：

I have done this using a few intermediary steps (there may be a simpler way, but this is how I would do it).

Firstly, group your df by Lot and return the first item for each lot using df.groupby to create a temporary df:

temp_df1 = df.groupby('Lot', as_index=False).first()[['Lot', 'Item']]

   Lot Item
0    1    a
1    2    g
2    3    k

I then used df.merge() to merge the original df onto temp_df to get only the rows from df that contain the first item for each lot:

temp_df2 = df.merge(temp_df1, on=['Lot', 'Item'], how='inner')

     Lot Item Code  Date Shelf Emp
0    1    a   c1    30    s1  e1
1    2    g   c1    40    s2  e2
2    2    g   c3     9    s1  e1
3    3    k   c2    20    s1  e1
4    3    k   c2    69    s2  e1
5    3    k   c1    29    s1  e1

Then you can group by Lot and Item on this data frame to get your desired output

df_out = temp_df2.groupby(['Lot', 'Item'], as_index=False).max()

   Lot Item Code  Date Shelf Emp
0    1    a   c1    30    s1  e1
1    2    g   c3    40    s2  e2
2    3    k   c2    69    s2  e1

Full code:

temp_df1 = df.groupby('Lot', as_index=False).first()[['Lot', 'Item']]
temp_df2 = df.merge(temp_df1, on=['Lot', 'Item'], how='inner')
df_out = temp_df2.groupby(['Lot', 'Item'], as_index=False).max()

Or the below without creating temporary dfs:

df_out = (df.merge(df.groupby('Lot', as_index=False).first()[['Lot', 'Item']],
                   on=['Lot', 'Item'], how='inner')
          .groupby(['Lot', 'Item'], as_index=False).max())

CodePudding user response：

Another method:

df1 = df.groupby(['Lot']).first()['Item'].reset_index()
df1['Date'] = [df['Date'].loc[df['Item'] == x].max() for x in df1['Date']]
df1

Result:

index	Lot	Item	Date
0	1	a	30
1	2	g	40
2	3	k	69

Edit: even better using a "mask" and no temp df

m = df.groupby(['Lot']).first()['Item'].tolist()
df[df.Item.isin(m)].groupby(['Lot', 'Item'], as_index=False).max()

index	Lot	Item	Code	Date	Shelf	Emp
0	1	a	c1	30	s1	e1
1	2	g	c3	40	s2	e2
2	3	k	c2	69	s2	e1