I have a Pandas DataFrame with a meaningful index and various groups of repeating rows. Suppose it looks like this:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 1, 1], [2, 3, 4], [2, 3, 4], [1, 1, 1], [1, 1, 1], [1, 1, 1], [3, 3, 3]], columns=["a", "b", "c"])
>>> df
a b c
0 1 1 1
1 2 3 4
2 2 3 4
3 1 1 1
4 1 1 1
5 1 1 1
6 3 3 3
I am trying to remove the repeated rows (apart from the first one in each repeating batch), but keep the index of the last row from the batch.
The result I am looking for is this (i.e. a new "last" column containing the index of the last repeated row from the batch, which will be equal to the index if there is no repeat):
>>> df2
last a b c
0 0 1 1 1
1 2 2 3 4
3 5 1 1 1
6 6 3 3 3
Notice that the [1, 1, 1]
entries appear twice, and are treated as separate blocks.
I have tried various combinations of group_by
, duplicated
, etc. but without stumbling on the necessary formulation. This feels like it should be a fairly standard thing to want to do. Is there a straightforward way to achieve this for an arbitrary DataFrame?
Edit:
Note that I would like to preserve the original index from the first items in the batch, and add a new column called, say, last
for the last index from the batch.
CodePudding user response:
So in your case
out = df[~df.shift().ne(df).cumsum().duplicated(keep='last')]
Out[19]:
a b c
0 1 1 1
2 2 3 4
5 1 1 1
6 3 3 3
CodePudding user response:
One way of doing this, similair to BENYs approach but using pandas.DataFrame.diff
:
df[~df.diff().cumsum().duplicated(keep='last')]
a b c
0 1 1 1
2 2 3 4
5 1 1 1
6 3 3 3
CodePudding user response:
Thanks to @BENY and @jab for your answers, which were very close to what I needed. I added the extra last
index column with some simple tweaks as follows:
last_indices = df[~df.diff().cumsum().duplicated(keep='last')].index
df2 = df[~df.diff().cumsum().duplicated(keep='first')]
df2.insert(0, "last", last_indices)
This yields:
last a b c
0 0 1 1 1
1 2 2 3 4
3 5 1 1 1
6 6 3 3 3