I want to aggregate the most recent documents from each source. The input is a list of documents sorted by most recent timestamp. Is there a more concise way of constructing the output?
Input:
docs = [
{
"timestamp": "2022-10-11T16:00:00.000000",
"source": "foo"
},
{
"timestamp": "2022-10-10T16:00:00.000000",
"source": "bar"
},
{
"timestamp": "2022-10-09T16:00:00.000000",
"source": "foo"
}
]
Output:
result = [
{
"timestamp": "2022-10-11T16:00:00.000000",
"source": "foo"
},
{
"timestamp": "2022-10-10T16:00:00.000000",
"source": "bar"
}
]
My attempt with iteration:
result = {}
for doc in docs:
if doc["source"] not in result:
result[doc["source"]] = doc
return list(result.values())
CodePudding user response:
There is no sorting in your attempt example. It will result in documents closest to the beginning of the list for each source instead of the most recent.
Taking into account that current timestamp format can be sorted just as is, (without parsing into datetime
object), straight implementation may looks like this:
# use intermediate dictionary for fast access to the latest seen document
# by its source as a key, (to avoid searching in array on each iteration)
doc_by_source = {}
for doc_current in docs:
source = doc_current["source"]
if source in doc_by_source:
doc_exist = doc_by_source[source]
# we can compare timestamp as strings due to current format
if doc_current["timestamp"] > doc_exist["timestamp"]:
# we found newer document, so store it instead of older one
doc_by_source[source] = doc_current
else:
# we have never seen documents from that source before, so store it
doc_by_source[source] = doc_current
# extract list of the most recent documents
recent_docs = list(doc_by_source.values())
Or it may be implemented as a "one-liner":
- At first we use
sort()
on original array of documents to order them from oldest to newest. And again we should keep in mind that sorting by timestamp string without parsing it into datetime object is correct only until our timestamp format starts from the greatest values on the left to the smallest values on the right. - Then we will create dictionary of documents by their sources. Every time when our dictionary receives the same document source it will replace document value stored under that source key. And new document value will be newer because we have sorted our documents by timestamp.
- And finally we extract our most recent document from our intermediate dictionary values.
recent_docs = list({
doc["source"]: doc
for doc in sorted(docs, key=lambda doc: doc["timestamp"])
}.values())
One-liners are faster in most cases, but they are harder to read.