Home > Back-end >  Group Pandas DataFrame repeated rows, preserving last index
Group Pandas DataFrame repeated rows, preserving last index

Time:10-27

I have a Pandas DataFrame with a meaningful index and various groups of repeating rows. Suppose it looks like this:

>>> import pandas as pd

>>> df = pd.DataFrame([[1, 1, 1], [2, 3, 4], [2, 3, 4], [1, 1, 1], [1, 1, 1], [1, 1, 1], [3, 3, 3]], columns=["a", "b", "c"])

>>> df
   a  b  c
0  1  1  1
1  2  3  4
2  2  3  4
3  1  1  1
4  1  1  1
5  1  1  1
6  3  3  3

I am trying to remove the repeated rows (apart from the first one in each repeating batch), but keep the index of the last row from the batch.

The result I am looking for is this (i.e. a new "last" column containing the index of the last repeated row from the batch, which will be equal to the index if there is no repeat):

>>> df2
   last  a  b  c
0     0  1  1  1
1     2  2  3  4
3     5  1  1  1
6     6  3  3  3

Notice that the [1, 1, 1] entries appear twice, and are treated as separate blocks.

I have tried various combinations of group_by, duplicated, etc. but without stumbling on the necessary formulation. This feels like it should be a fairly standard thing to want to do. Is there a straightforward way to achieve this for an arbitrary DataFrame?

Edit:

Note that I would like to preserve the original index from the first items in the batch, and add a new column called, say, last for the last index from the batch.

CodePudding user response:

So in your case

out = df[~df.shift().ne(df).cumsum().duplicated(keep='last')]
Out[19]: 
   a  b  c
0  1  1  1
2  2  3  4
5  1  1  1
6  3  3  3

CodePudding user response:

One way of doing this, similair to BENYs approach but using pandas.DataFrame.diff:

df[~df.diff().cumsum().duplicated(keep='last')]

   a  b  c
0  1  1  1
2  2  3  4
5  1  1  1
6  3  3  3

CodePudding user response:

Thanks to @BENY and @jab for your answers, which were very close to what I needed. I added the extra last index column with some simple tweaks as follows:

last_indices = df[~df.diff().cumsum().duplicated(keep='last')].index

df2 = df[~df.diff().cumsum().duplicated(keep='first')]
df2.insert(0, "last", last_indices)

This yields:

   last  a  b  c
0     0  1  1  1
1     2  2  3  4
3     5  1  1  1
6     6  3  3  3
  • Related