I have a dataframe similar to the following.
import pandas as pd
data = pd.DataFrame({'ind': [111,222,333,444,555,666,777,888,999,000],
'col1': [1,2,2,2,3,4,5,5,6,7],
'col2': [9,2,2,2,9,9,5,5,9,9],
'col3': [11,2,2,2,11,11,5,5,11,11],
'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']})
There is an index ind
, a number of columns col
1, 2 and 3, and some other column with a value val
. Within the three columns 1, 2 and 3 there are a number of rows which are the exact same as the previous row, for instance row with index 333 and 444 are the same as 222. My actual data set is larger but what I need to do is delete all rows which have the exact same value as the immediate previous row for a number of columns (col1
, col2
, col3
here).
This would give me a dataframe like this with indeces 333/444 and 888 removed:
data_clean = pd.DataFrame({'ind': [111,222,555,666,777,999,000],
'col1': [1,2,3,4,5,6,7],
'col2': [9,2,9,9,5,9,9],
'col3': [11,2,11,11,5,11,11],
'val': ['a', 'b', 'e', 'f', 'g', 'i', 'j']})
What is the best way to go about this for a larger dataframe?
CodePudding user response:
You can use shift
and any
for boolean indexing:
cols = ['col1', 'col2', 'col3']
out = data[data[cols].ne(data[cols].shift()).any(axis=1)]
# DeMorgan's equivalent:
# out = data[~data[cols].eq(data[cols].shift()).all(axis=1)]
Output:
ind col1 col2 col3 val
0 111 1 9 11 a
1 222 2 2 2 b
4 555 3 9 11 e
5 666 4 9 11 f
6 777 5 5 5 g
8 999 6 9 11 i
9 0 7 9 11 j
Intermediates
# shifted dataset
data[cols].shift()
col1 col2 col3
0 NaN NaN NaN
1 1.0 9.0 11.0
2 2.0 2.0 2.0
3 2.0 2.0 2.0
4 2.0 2.0 2.0
5 3.0 9.0 11.0
6 4.0 9.0 11.0
7 5.0 5.0 5.0
8 5.0 5.0 5.0
9 6.0 9.0 11.0
# comparison
data[cols].ne(data[cols].shift())
col1 col2 col3
0 True True True
1 True True True
2 False False False
3 False False False
4 True True True
5 True False False
6 True True True
7 False False False
8 True True True
9 True False False
# aggregation
data[cols].ne(data[cols].shift()).any(axis=1)
0 True
1 True
2 False
3 False
4 True
5 True
6 True
7 False
8 True
9 True
dtype: bool
CodePudding user response:
One way to do this would be to use the shift()
method to compare the values in each row with the previous row for the columns you're interested in, and then use the boolean indexing to select only the rows that have different values from the previous row. Here's an example:
# First, let's create a DataFrame with the columns we want to compare
compare_df = data[['col1', 'col2', 'col3']]
# Next, use the shift() method to compare each row with the previous row
comparison_result = compare_df.eq(compare_df.shift())
# Now we can use boolean indexing to select only the rows that have different values
# from the previous row
data_clean = data[~comparison_result.all(axis=1)]
This should give you a DataFrame that only contains rows that are different from the previous row for the specified columns.
Alternatively, you could use the drop_duplicates() method to remove duplicate rows based on the values in the specified columns. This method will remove all rows that have the same values for the specified columns, regardless of whether they are consecutive or not. Here's an example:
# Use the drop_duplicates() method to remove duplicate rows
data_clean = data.drop_duplicates(subset=['col1', 'col2', 'col3'])
This should give you the same result as the previous approach, but in a more concise and straightforward way.