I have a .csv file.
time,open,high,low,close,Extremum,Fib 1,Fib 2,Fib 3,l100,LS3,SS3,Volume,Volume MA
lots of rows like:
2022-04-08T02:00:00 02:00,43.431,43.44,43.431,43.44,44.669,43.58332033414956,43.28818411430672,43.11250779297169,42.91223678664976,,,78.07,
And there are duplicates, like 4 of them, with differences at "Extremum" column Like this:
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,41.589,42.64812186602502,42.93603848979882,43.10741743252131,43.30278942722496,,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
Its sorted by 'time', axis=0 (its Column A, column 0 in calc sheet)
csvData.sort_values(by=["time"],axis=0,ascending=True,inplace=True,na_position='first')
Its 4 dupes at time 17:10:25, how to throw out the not matching one?
Here we have: 41.589, 43.6, 43.6, 43.6. The 41.589 is wrong, and need to be out, and need to keep only 1 copy of remaining 3 dupes (that drop.duplicated can do, but it can't give me 4 dupes to process, it can be set only in 3 ways: keep='first', keep='last' or keep=False, and I need not existent keep=True.. I need return of all 4 dupes, to check which one is that 1 of 4 that's bad to out, before I unique_seen them all, to reduce only to 1, with correct 43.6 in this case. Does someone got any ideas how to achieve that? Saw some ideas here at stack, but can't understand them enough to apply to my case, so I'm kindly asking for help.
CodePudding user response:
You can use twice duplicated
with two different modes: keep=False
and another of your choice. Then compute a boolean mask from those two for slicing.
Assuming this example dataset:
date col other
0 a a 0
1 a a 1
2 a X 2 # unique
3 a a 3
4 b Y 4 # unique
5 b b 5
6 b b 6
7 b b 7
You can use:
m1 = df.duplicated(subset=['date','col'])
m2 = df.duplicated(subset=['date','col'], keep=False)
df2 = df[m1!=m2]
Output:
date col other
0 a a 0
5 b b 5
Intermediates:
date col other m1 m2 m1!=m2
0 a a 0 False True True
1 a a 1 True True False
2 a X 2 False False False
3 a a 3 True True False
4 b Y 4 False False False
5 b b 5 False True True
6 b b 6 True True False
7 b b 7 True True False