I have a pandas df as follows:
MATERIAL DATE HIGH LOW
AAA 2022-01-01 10 0
AAA 2022-01-02 0 0
AAA 2022-01-03 5 2
BBB 2022-01-01 0 0
BBB 2022-01-02 10 5
BBB 2022-01-03 8 4
I want to groupby
MATERIAL
and sort_values
by DATE
and choose all rows except last one in the group.
The resulting result should be:
MATERIAL DATE HIGH LOW
AAA 2022-01-01 10 0
AAA 2022-01-02 0 0
BBB 2022-01-01 0 0
BBB 2022-01-02 10 5
I have tried df.sort_values('DATE').groupby('MATERIAL').head(-1)
but this results in an empty df.
The DATE
is a pd.datetime
object.
Thanks!
CodePudding user response:
Use Series.duplicated
with keep='last'
for all values without last:
df = df.sort_values(['MATERIAL','DATE'])
df = df[df['MATERIAL'].duplicated(keep='last')]
print (df)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
With groupby solution is possible by GroupBy.cumcount
with descending count and filter all rows without 0
:
df = df.sort_values(['MATERIAL','DATE'])
df = df[df.groupby('MATERIAL').cumcount(ascending=False).ne(0)]
print (df)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
CodePudding user response:
You could use:
(df.groupby('MATERIAL', as_index=False, group_keys=False)
.apply(lambda d: d.iloc[:len(d)-1])
)
output:
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
CodePudding user response:
Another way is to sort by dates first, then group and take every row except the last one using indexing:
>>> df.sort_values("DATE").groupby("MATERIAL").apply(lambda group_df: group_df.iloc[:-1])
MATERIAL DATE HIGH LOW
MATERIAL
AAA 0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
BBB 3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
CodePudding user response:
Another way would be using groupby transform
with nth
as -1, and compare this with DATE column and only select rows which doesnot match this:
df = df.sort_values(['MATERIAL','DATE'])
c = df['DATE'].ne(df.groupby("MATERIAL")['DATE'].transform('nth',-1))
out = df[c].copy()
print(out)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
Side note: Since you have a date column, you can also use transform with max
or last
but that would only limit you to the last row as opposed to the second last row for example for which you might need nth
as shown above:
c = df['DATE'].ne(df.groupby("MATERIAL")['DATE'].transform('max'))