Home > other >  How to remove duplicates from list of strings based on timestamp
How to remove duplicates from list of strings based on timestamp

Time:07-18

I have the following list:

ls = ["2022-07-17 16:00:02 txt xyz", "2022-07-17 15:00:02 txt xyz", "2022-07-17 16:00:02 txt abc"]

I only want to keep entries where the text is unique (xyz and abc), and where the timestamp is newer. This is my expected outcome:

ls = ["2022-07-17 16:00:02 txt xyz", "2022-07-17 16:00:02 txt abc"]

My approach was to use a dictionary sorted by value, but then I still don't know how to remove the older timestamp.

import re

keep_message = {}
for i in range(len(ls)):
    timestamp_str = re.search(r"^(.*?) txt", ls[i]).group(1)
    timestamp = datetime.datetime.strptime(timestamp_str, "%Y-%m-%d %H:%M:%S")
    text = re.search(r"txt (.*?)$", ls[i]).group(1)
    keep_message[text   "_"   timestamp_str] = timestamp

keep_message_sorted = dict(sorted(keep_message.items(), key=lambda item: item[1]))

Is there a better solution?

CodePudding user response:

Use a dictionary to keep track of the most recent date per text:

d = {}
for x in ls:
    # get txt (NB. you can also use a regex)
    ts, txt = x.split(' txt ', 1)
    if txt not in d or x > d[txt]:
        d[txt] = x

out = list(d.values())

NB. I used a simple split to get the txt and also performed the comparison on the full string as the date is first and in a format compatible with sorting as string. However, you can use another extraction method (regex), and perform the comparison only on the datetime part.

Output:

['2022-07-17 16:00:02 txt xyz', '2022-07-17 16:00:02 txt abc']
  • Related