Home > OS >  remove rows with duplicated values of two columns that are shifted in pandas data frame?
remove rows with duplicated values of two columns that are shifted in pandas data frame?

Time:11-01

I have a following dataframe:

import pandas as pd

data = [['1044', '1924'], ['1044', '1926'], ['1044', '1927'], ['1044', '1928'], ['1048', '1924'], ['1048', '1926'], ['1048', '1927'], ['1048', '1928'], ['1051', '1924'], ['1051', '1926'], ['1051', '1927'], ['1051', '1928'], ['1058', '']]  
df = pd.DataFrame(data, columns = ['Col1', 'Col2']) 

enter image description here

I would like to reduce the dataframe as shown here:

enter image description here

I have no clue how to do that, but it should be working as follows:

  • keep the first duplicate in the column "Date" - in this example row 0
  • drop the second duplicate in the column "Value" - in this example row 3

Hopefully it does exist a easier way how to perform it.

CodePudding user response:

You can build a new dataframe instead, with the unique values:

pd.DataFrame({key:pd.unique(value) for key, value in df.items()})
Out[252]: 
   Col1  Col2
0  1044  1924
1  1048  1926
2  1051  1927

CodePudding user response:

Here is another method that should work irrespective of the order, with an arbitrary number of columns, and which should be easy to customize if more complex selections are needed:

(df.assign(x=df.groupby('Col1').ngroup(),
           y=df.groupby('Col1').cumcount())
  .query('x==y') # this can be updated in case more complex selections are required
  .drop(['x','y'], axis=1)
)

Output:

   Col1  Col2
0  1044  1924
4  1048  1926
8  1051  1927

CodePudding user response:

I think, cumcount each column and check similarities between the outputs, Code below;

df =df.assign(Date_cumcount=df.groupby('Date').cumcount(),Value_cumcount = df.groupby('Value').cumcount())
out =df[df['Date_cumcount']==df['Value_cumcount']].drop(['Date_cumcount','Value_cumcount'],1)



     Date Value
0   1044  1924
5   1048  1926
10  1051  1927
12  1058  
  • Related