Home > OS >  How to keep a specific duplicate from among duplicates?
How to keep a specific duplicate from among duplicates?

Time:04-16

I have a .csv file.

time,open,high,low,close,Extremum,Fib 1,Fib 2,Fib 3,l100,LS3,SS3,Volume,Volume MA

lots of rows like:

2022-04-08T02:00:00 02:00,43.431,43.44,43.431,43.44,44.669,43.58332033414956,43.28818411430672,43.11250779297169,42.91223678664976,,,78.07,

And there are duplicates, like 4 of them, with differences at "Extremum" column Like this:

2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,41.589,42.64812186602502,42.93603848979882,43.10741743252131,43.30278942722496,,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954

Its sorted by 'time', axis=0 (its Column A, column 0 in calc sheet)

csvData.sort_values(by=["time"],axis=0,ascending=True,inplace=True,na_position='first')

Its 4 dupes at time 17:10:25, how to throw out the not matching one?

Here we have: 41.589, 43.6, 43.6, 43.6. The 41.589 is wrong, and need to be out, and need to keep only 1 copy of remaining 3 dupes (that drop.duplicated can do, but it can't give me 4 dupes to process, it can be set only in 3 ways: keep='first', keep='last' or keep=False, and I need not existent keep=True.. I need return of all 4 dupes, to check which one is that 1 of 4 that's bad to out, before I unique_seen them all, to reduce only to 1, with correct 43.6 in this case. Does someone got any ideas how to achieve that? Saw some ideas here at stack, but can't understand them enough to apply to my case, so I'm kindly asking for help.

CodePudding user response:

You can use twice duplicated with two different modes: keep=False and another of your choice. Then compute a boolean mask from those two for slicing.

Assuming this example dataset:

  date col  other
0    a   a      0
1    a   a      1
2    a   X      2   # unique
3    a   a      3
4    b   Y      4   # unique
5    b   b      5
6    b   b      6
7    b   b      7

You can use:

m1 = df.duplicated(subset=['date','col'])
m2 = df.duplicated(subset=['date','col'], keep=False)

df2 = df[m1!=m2]

Output:

  date col  other
0    a   a      0
5    b   b      5

Intermediates:

  date col  other     m1     m2  m1!=m2
0    a   a      0  False   True    True
1    a   a      1   True   True   False
2    a   X      2  False  False   False
3    a   a      3   True   True   False
4    b   Y      4  False  False   False
5    b   b      5  False   True    True
6    b   b      6   True   True   False
7    b   b      7   True   True   False
  • Related