I have the following list:
ls = ["2022-07-17 16:00:02 txt xyz", "2022-07-17 15:00:02 txt xyz", "2022-07-17 16:00:02 txt abc"]
I only want to keep entries where the text is unique (xyz and abc), and where the timestamp is newer. This is my expected outcome:
ls = ["2022-07-17 16:00:02 txt xyz", "2022-07-17 16:00:02 txt abc"]
My approach was to use a dictionary sorted by value, but then I still don't know how to remove the older timestamp.
import re
keep_message = {}
for i in range(len(ls)):
timestamp_str = re.search(r"^(.*?) txt", ls[i]).group(1)
timestamp = datetime.datetime.strptime(timestamp_str, "%Y-%m-%d %H:%M:%S")
text = re.search(r"txt (.*?)$", ls[i]).group(1)
keep_message[text "_" timestamp_str] = timestamp
keep_message_sorted = dict(sorted(keep_message.items(), key=lambda item: item[1]))
Is there a better solution?
CodePudding user response:
Use a dictionary to keep track of the most recent date per text:
d = {}
for x in ls:
# get txt (NB. you can also use a regex)
ts, txt = x.split(' txt ', 1)
if txt not in d or x > d[txt]:
d[txt] = x
out = list(d.values())
NB. I used a simple split
to get the txt and also performed the comparison on the full string as the date is first and in a format compatible with sorting as string. However, you can use another extraction method (regex), and perform the comparison only on the datetime part.
Output:
['2022-07-17 16:00:02 txt xyz', '2022-07-17 16:00:02 txt abc']