Home > Net >  Pandas group by and choose all rows except last one in group
Pandas group by and choose all rows except last one in group

Time:03-08

I have a pandas df as follows:

MATERIAL    DATE         HIGH    LOW
AAA       2022-01-01     10      0
AAA       2022-01-02     0       0
AAA       2022-01-03     5       2
BBB       2022-01-01     0       0
BBB       2022-01-02     10      5
BBB       2022-01-03     8       4

I want to groupby MATERIAL and sort_values by DATE and choose all rows except last one in the group. The resulting result should be:

MATERIAL    DATE         HIGH    LOW
AAA       2022-01-01     10      0
AAA       2022-01-02     0       0
BBB       2022-01-01     0       0
BBB       2022-01-02     10      5

I have tried df.sort_values('DATE').groupby('MATERIAL').head(-1) but this results in an empty df. The DATE is a pd.datetime object. Thanks!

CodePudding user response:

Use Series.duplicated with keep='last' for all values without last:

df = df.sort_values(['MATERIAL','DATE'])
df = df[df['MATERIAL'].duplicated(keep='last')]
print (df)
  MATERIAL        DATE  HIGH  LOW
0      AAA  2022-01-01    10    0
1      AAA  2022-01-02     0    0
3      BBB  2022-01-01     0    0
4      BBB  2022-01-02    10    5

With groupby solution is possible by GroupBy.cumcount with descending count and filter all rows without 0:

df = df.sort_values(['MATERIAL','DATE'])
df = df[df.groupby('MATERIAL').cumcount(ascending=False).ne(0)]
print (df)
  MATERIAL        DATE  HIGH  LOW
0      AAA  2022-01-01    10    0
1      AAA  2022-01-02     0    0
3      BBB  2022-01-01     0    0
4      BBB  2022-01-02    10    5

CodePudding user response:

You could use:

(df.groupby('MATERIAL', as_index=False, group_keys=False)
   .apply(lambda d: d.iloc[:len(d)-1])
)

output:

  MATERIAL        DATE  HIGH  LOW
0      AAA  2022-01-01    10    0
1      AAA  2022-01-02     0    0
3      BBB  2022-01-01     0    0
4      BBB  2022-01-02    10    5

CodePudding user response:

Another way is to sort by dates first, then group and take every row except the last one using indexing:

>>> df.sort_values("DATE").groupby("MATERIAL").apply(lambda group_df: group_df.iloc[:-1])
           MATERIAL        DATE  HIGH  LOW
MATERIAL                                  
AAA      0      AAA  2022-01-01    10    0
         1      AAA  2022-01-02     0    0
BBB      3      BBB  2022-01-01     0    0
         4      BBB  2022-01-02    10    5

CodePudding user response:

Another way would be using groupby transform with nth as -1, and compare this with DATE column and only select rows which doesnot match this:


df = df.sort_values(['MATERIAL','DATE'])
c = df['DATE'].ne(df.groupby("MATERIAL")['DATE'].transform('nth',-1))
out = df[c].copy()

print(out)

  MATERIAL        DATE  HIGH  LOW
0      AAA  2022-01-01    10    0
1      AAA  2022-01-02     0    0
3      BBB  2022-01-01     0    0
4      BBB  2022-01-02    10    5

Side note: Since you have a date column, you can also use transform with max or last but that would only limit you to the last row as opposed to the second last row for example for which you might need nth as shown above:

c = df['DATE'].ne(df.groupby("MATERIAL")['DATE'].transform('max'))
  • Related