Home > Enterprise >  Python DataFrame Multiple Column Groupby With Max Value From First Group
Python DataFrame Multiple Column Groupby With Max Value From First Group

Time:06-10

Data Frame is:

df=pd.DataFrame([[1,'a','c1',30,'s1','e1'],[1,'b','c1',60,'s1','e1'],[1,'b','c2',40,'s1','e1'],[2,'g','c1',40,'s2','e2'],[2,'g','c3',9,'s1','e1'],[3,'k','c2',20,'s1','e1'],[3,'k','c2',69,'s2','e1'],[3,'k','c1',29,'s1','e1'],[3,'f','c3',99,'s2','e1']], columns = ['Lot','Item','Code','Date','Shelf','Emp'])
     Lot Item Code  Date Shelf Emp
0    1    a   c1    30    s1  e1
1    1    b   c1    60    s1  e1
2    1    b   c2    40    s1  e1
3    2    g   c1    40    s2  e2
4    2    g   c3     9    s1  e1
5    3    k   c2    20    s1  e1
6    3    k   c2    69    s2  e1
7    3    k   c1    29    s1  e1
8    3    f   c3    99    s2  e1

I need to

  1. Group By Lot and then Item
  2. Get the First Item in each Lot
  3. Finally Get the Max Date for each First Item in Each Lot

The OUTPUT should be:

     Lot Item  Code Date Shelf Emp
0    1    a    c1   30   s1    e1
3    2    g    c2   40   s2    e1
6    3    k    c2   69   s2    e1

Thank you !!!

CodePudding user response:

I have done this using a few intermediary steps (there may be a simpler way, but this is how I would do it).

Firstly, group your df by Lot and return the first item for each lot using df.groupby to create a temporary df:

temp_df1 = df.groupby('Lot', as_index=False).first()[['Lot', 'Item']]

   Lot Item
0    1    a
1    2    g
2    3    k

I then used df.merge() to merge the original df onto temp_df to get only the rows from df that contain the first item for each lot:

temp_df2 = df.merge(temp_df1, on=['Lot', 'Item'], how='inner')

     Lot Item Code  Date Shelf Emp
0    1    a   c1    30    s1  e1
1    2    g   c1    40    s2  e2
2    2    g   c3     9    s1  e1
3    3    k   c2    20    s1  e1
4    3    k   c2    69    s2  e1
5    3    k   c1    29    s1  e1

Then you can group by Lot and Item on this data frame to get your desired output

df_out = temp_df2.groupby(['Lot', 'Item'], as_index=False).max()

   Lot Item Code  Date Shelf Emp
0    1    a   c1    30    s1  e1
1    2    g   c3    40    s2  e2
2    3    k   c2    69    s2  e1

Full code:

temp_df1 = df.groupby('Lot', as_index=False).first()[['Lot', 'Item']]
temp_df2 = df.merge(temp_df1, on=['Lot', 'Item'], how='inner')
df_out = temp_df2.groupby(['Lot', 'Item'], as_index=False).max()

Or the below without creating temporary dfs:

df_out = (df.merge(df.groupby('Lot', as_index=False).first()[['Lot', 'Item']],
                   on=['Lot', 'Item'], how='inner')
          .groupby(['Lot', 'Item'], as_index=False).max())

CodePudding user response:

Another method:

df1 = df.groupby(['Lot']).first()['Item'].reset_index()
df1['Date'] = [df['Date'].loc[df['Item'] == x].max() for x in df1['Date']]
df1

Result:

index Lot Item Date
0 1 a 30
1 2 g 40
2 3 k 69

Edit: even better using a "mask" and no temp df

m = df.groupby(['Lot']).first()['Item'].tolist()
df[df.Item.isin(m)].groupby(['Lot', 'Item'], as_index=False).max()
index Lot Item Code Date Shelf Emp
0 1 a c1 30 s1 e1
1 2 g c3 40 s2 e2
2 3 k c2 69 s2 e1
  • Related