Home > Blockchain >  Remove duplicate rows based on previous rows' values in a specific column
Remove duplicate rows based on previous rows' values in a specific column

Time:11-17

I have a dataframe similar to the following example:

import pandas as pd
data = pd.DataFrame(data={'col1': [1,2,3,4,5,6,7,8,9], 'col2': [1.55,1.55,1.55,1.8,1.9,1.9,1.9,2.1,2.1]})

In the second column, col2, several duplicate values can be seen, 3 times 1.55, 3 times 1.9 and 2 times 2.1. What I need to do is remove all rows which are a duplicate of its previous row. So, the first rows are the ones I'd like to keep. In this example, this would be the rows with col2 value 1, 4, 5, 8 giving the following dataframe as my desired output:

clean_data = pd.DataFrame(data={'col1': [1,4,5,8], 'col2': [1.55,1.8,1.9,2.1]})

What is the best way to go about this for a dataframe which is much larger (in terms of rows) than this small example?

CodePudding user response:

You can use shift:

data.loc[data['col2'] != data['col2'].shift(1)]
  • Related