Home > OS >  Aggregate data set where a specific field must be unique
Aggregate data set where a specific field must be unique

Time:10-17

I want to aggregate the most recent documents from each source. The input is a list of documents sorted by most recent timestamp. Is there a more concise way of constructing the output?

Input:

docs = [
   {
       "timestamp": "2022-10-11T16:00:00.000000",
       "source": "foo"
   },
   {
       "timestamp": "2022-10-10T16:00:00.000000",
       "source": "bar"
   },
   {
       "timestamp": "2022-10-09T16:00:00.000000",
       "source": "foo"
   }
]

Output:

result = [
   {
       "timestamp": "2022-10-11T16:00:00.000000",
       "source": "foo"
   },
   {
       "timestamp": "2022-10-10T16:00:00.000000",
       "source": "bar"
   }
]

My attempt with iteration:

result = {}

for doc in docs:
   if doc["source"] not in result:
      result[doc["source"]] = doc

return list(result.values())

CodePudding user response:

There is no sorting in your attempt example. It will result in documents closest to the beginning of the list for each source instead of the most recent.

Taking into account that current timestamp format can be sorted just as is, (without parsing into datetime object), straight implementation may looks like this:

# use intermediate dictionary for fast access to the latest seen document
# by its source as a key, (to avoid searching in array on each iteration)
doc_by_source = {}
for doc_current in docs:
    source = doc_current["source"]
    if source in doc_by_source:
        doc_exist = doc_by_source[source]
        # we can compare timestamp as strings due to current format
        if doc_current["timestamp"] > doc_exist["timestamp"]:
            # we found newer document, so store it instead of older one
            doc_by_source[source] = doc_current
    else:
        # we have never seen documents from that source before, so store it
        doc_by_source[source] = doc_current
# extract list of the most recent documents
recent_docs = list(doc_by_source.values())

Or it may be implemented as a "one-liner":

  • At first we use sort() on original array of documents to order them from oldest to newest. And again we should keep in mind that sorting by timestamp string without parsing it into datetime object is correct only until our timestamp format starts from the greatest values on the left to the smallest values on the right.
  • Then we will create dictionary of documents by their sources. Every time when our dictionary receives the same document source it will replace document value stored under that source key. And new document value will be newer because we have sorted our documents by timestamp.
  • And finally we extract our most recent document from our intermediate dictionary values.
recent_docs = list({
    doc["source"]: doc
    for doc in sorted(docs, key=lambda doc: doc["timestamp"])
}.values())

One-liners are faster in most cases, but they are harder to read.

  • Related