Home > front end >  Delete duplicates and sort tab by dates in Python
Delete duplicates and sort tab by dates in Python

Time:07-25

I receive as input a table in which there will be two data per line, a date and an associated value. I am looking for one or more methods that could filter the table to remove the rows that exist in several times and then arrange the rows according to the date.
My tab looks like this:

(datetime.timedelta(seconds=34781, microseconds=474000), 0.004936)
(datetime.timedelta(seconds=33586, microseconds=443000), 0.003214)
(datetime.timedelta(seconds=34781, microseconds=474000), 0.004936)
(datetime.timedelta(seconds=38306, microseconds=654000), 0.001765)
(datetime.timedelta(seconds=38306, microseconds=654000), 0.001765)
(datetime.timedelta(seconds=31245, microseconds=474000), 0.004938)
...

Until now I tried to use

import datetime

listData = [(datetime.timedelta(seconds=34781, microseconds=474000), 0.004936),
            (datetime.timedelta(seconds=33586, microseconds=443000), 0.003214),
            (datetime.timedelta(seconds=34781, microseconds=474000), 0.004936),
            (datetime.timedelta(seconds=38306, microseconds=654000), 0.001765),
            (datetime.timedelta(seconds=38306, microseconds=654000), 0.001765),
            (datetime.timedelta(seconds=31245, microseconds=474000), 0.004938)]

finalTab = []
for i in listData:
   if i not in finalTab: finalTab.append(i)

print(finalTab)

But this method would just manage half of the issue and it would takes too much time I think, I have to process files of this format of several gigabytes (~ 21 600 000 lines).

I need to something like this as output:

(datetime.timedelta(seconds=31245, microseconds=474000), 0.004938)
(datetime.timedelta(seconds=33586, microseconds=443000), 0.003214)
(datetime.timedelta(seconds=34781, microseconds=474000), 0.004936)
(datetime.timedelta(seconds=38306, microseconds=654000), 0.001765)
...

CodePudding user response:

It was actually quite simple, thanks to Timus this code answer the question :

import datetime

listData = [(datetime.timedelta(seconds=34781, microseconds=474000), 0.004936),
            (datetime.timedelta(seconds=33586, microseconds=443000), 0.003214),
            (datetime.timedelta(seconds=34781, microseconds=474000), 0.004936),
            (datetime.timedelta(seconds=38306, microseconds=654000), 0.001765),
            (datetime.timedelta(seconds=38306, microseconds=654000), 0.001765),
            (datetime.timedelta(seconds=31245, microseconds=474000), 0.004938)]

finalTab = sorted(set(listData), key=lambda t: t[0])

print(finalTab)
  • Related