Home > Software engineering >  Summing duplicates in a list of dictionaries by a compound key using itertools
Summing duplicates in a list of dictionaries by a compound key using itertools

Time:03-03

I have a sorted list of dictionaries like so:

dat = [
      {"id1": 1, "id2": 2, "value": 1},
      {"id1": 1, "id2": 2, "value": 2},
      {"id1": 2, "id2": 2, "value": 2},
      {"id1": 2, "id2": 3, "value": 1},
      {"id1": 3, "id2": 3, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      ]

This is effectively (id1, id2, value) tuples, but where there are duplicates. I would like to deduplicate these by summing the values where both ids are equal, leaving me with unique (id1, id2) pairs where the new value is the sum of the dupes.

That is, from above, the desired output is:

dat =[
     {'id1': 1, 'id2': 2, 'value': 3},
     {'id1': 2, 'id2': 2, 'value': 2},
     {'id1': 2, 'id2': 3, 'value': 1},
     {'id1': 3, 'id2': 3, 'value': 1},
     {'id1': 3, 'id2': 4, 'value': 4}
     ]

Assume the list is millions with lots of duplicates. What's the most efficient way to do this using itertools or funcy (versus say, using pandas)?

CodePudding user response:

You can start with collections.Counter and use the = operator, the convenient part of the Counter is that = assumes zero on inexisting keys.

dat = [
      {"id1": 1, "id2": 2, "value": 1},
      {"id1": 1, "id2": 2, "value": 2},
      {"id1": 2, "id2": 2, "value": 2},
      {"id1": 2, "id2": 3, "value": 1},
      {"id1": 3, "id2": 3, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      ]

from collections import Counter
cnt = Counter()
for item in dat:
  cnt[item["id1"], item["id2"]]  = item["value"]

[{'id1':id1, 'id2': id2, 'value':v}for (id1, id2), v in cnt.items()]

Giving

[{'id1': 1, 'id2': 2, 'value': 3},
 {'id1': 2, 'id2': 2, 'value': 2},
 {'id1': 2, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 4, 'value': 4}]

CodePudding user response:

Just for fun, a purely itertools solution (no use of collections or otherwise using any intermediate containers that must be built and updated incrementally if the list is already in key order, though it requires a pre-sort if you can't guaranteed it's already sorted to group unique id pairs together):

# At top of file
from itertools import groupby

# Also at top of file; not strictly necessary, but I find it's nicer to make cheap getters
# with self-documenting names
from operator import itemgetter
get_ids = itemgetter('id1', 'id2')
get_value = itemgetter('value')

# On each use:
dat.sort(key=get_ids)  # Not needed if data guaranteed grouped by unique id1/id2 pairs as in example

dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
       for (id1, id2), group in groupby(dat, key=get_ids)]

# If sorting needed, you can optionally one-line as the rather overly dense (I don't recommend it):
dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
       for (id1, id2), group in groupby(sorted(dat, key=get_ids), key=get_ids)]

Personally, I'd generally use Counter or defaultdict(int) as shown in the other answers, as they get O(n) performance even with unsorted data (groupby is O(n), but if you need to sort first, the sorting is O(n log n)). Basically the only time this even has a theoretical advantage is when the data is already sorted and you value using a one-liner (excluding imports and one-time setup cost to make itemgetters); in practice, itertools.groupby has sufficient overhead that it still typically loses to one or both of collections.Counter/collections.defaultdict(int), especially when using collections.Counter in its optimized modes for counting iterables of things to count (that don't apply here, but are worth knowing about).

CodePudding user response:

We could use collections.defaultdict as well:

from collections import defaultdict
tmp = defaultdict(int)
for d in dat:
    tmp[d['id1'], d['id2']]  = d['value']
out = [{'id1':id1, 'id2':id2, 'value':v} for (id1, id2), v in tmp.items()]

or (assuming the ids are sorted), itertools.groupby:

from itertools import groupby
out = [{'id1': k1, 'id2': k2, 'value': sum(d['value'] for d in g)} for (k1,k2), g in groupby(dat, lambda x: (x['id1'], x['id2']))]

or groupby sum to_dict in pandas:

out = pd.DataFrame(dat).groupby(['id1','id2'], as_index=False)['value'].sum().to_dict('records')

Output:

[{'id1': 1, 'id2': 2, 'value': 3},
 {'id1': 2, 'id2': 2, 'value': 2},
 {'id1': 2, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 4, 'value': 4}]


A basic benchmark on the provided data says groupby using itemgetter (as suggested by @ShadowRanger) is the fastest:

  1. defaultdict: 6.57 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  2. Counter (Bob): 9.56 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  3. groupby itemgetter (ShadowRanger): 6.01 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  4. groupby lambda: 9.02 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  5. pandas: 3.81 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Now if we duplicate dat 1mil times, i.e. Do

dat = dat*1_000_000
dat.sort(key=itemgetter('id1', 'id2'))

and do the same benchmark again, groupby with itemgetter is the runaway winner:

  1. 3.91 s ± 320 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  2. 5.38 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  3. 1.77 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  4. 3.53 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  5. 15.2 s ± 831 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ran on Python 3.9.7 (64bit).

This benchmark somewhat favors groupby since there are very few groups when we duplicate an existing small list of dicts. If create randomize the size of "group"s, groupby itemgetter still beats all but the difference is not as stark.

  • Related