I have a sorted list of dictionaries like so:
dat = [
{"id1": 1, "id2": 2, "value": 1},
{"id1": 1, "id2": 2, "value": 2},
{"id1": 2, "id2": 2, "value": 2},
{"id1": 2, "id2": 3, "value": 1},
{"id1": 3, "id2": 3, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
]
This is effectively (id1, id2, value) tuples, but where there are duplicates. I would like to deduplicate these by summing the values where both ids are equal, leaving me with unique (id1, id2) pairs where the new value is the sum of the dupes.
That is, from above, the desired output is:
dat =[
{'id1': 1, 'id2': 2, 'value': 3},
{'id1': 2, 'id2': 2, 'value': 2},
{'id1': 2, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 4, 'value': 4}
]
Assume the list is millions with lots of duplicates. What's the most efficient way to do this using itertools
or funcy
(versus say, using pandas)?
CodePudding user response:
You can start with collections.Counter
and use the =
operator, the convenient part of the Counter
is that =
assumes zero on inexisting keys.
dat = [
{"id1": 1, "id2": 2, "value": 1},
{"id1": 1, "id2": 2, "value": 2},
{"id1": 2, "id2": 2, "value": 2},
{"id1": 2, "id2": 3, "value": 1},
{"id1": 3, "id2": 3, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
]
from collections import Counter
cnt = Counter()
for item in dat:
cnt[item["id1"], item["id2"]] = item["value"]
[{'id1':id1, 'id2': id2, 'value':v}for (id1, id2), v in cnt.items()]
Giving
[{'id1': 1, 'id2': 2, 'value': 3},
{'id1': 2, 'id2': 2, 'value': 2},
{'id1': 2, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 4, 'value': 4}]
CodePudding user response:
Just for fun, a purely itertools
solution (no use of collections
or otherwise using any intermediate containers that must be built and updated incrementally if the list
is already in key order, though it requires a pre-sort if you can't guaranteed it's already sorted to group unique id pairs together):
# At top of file
from itertools import groupby
# Also at top of file; not strictly necessary, but I find it's nicer to make cheap getters
# with self-documenting names
from operator import itemgetter
get_ids = itemgetter('id1', 'id2')
get_value = itemgetter('value')
# On each use:
dat.sort(key=get_ids) # Not needed if data guaranteed grouped by unique id1/id2 pairs as in example
dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
for (id1, id2), group in groupby(dat, key=get_ids)]
# If sorting needed, you can optionally one-line as the rather overly dense (I don't recommend it):
dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
for (id1, id2), group in groupby(sorted(dat, key=get_ids), key=get_ids)]
Personally, I'd generally use Counter
or defaultdict(int)
as shown in the other answers, as they get O(n)
performance even with unsorted data (groupby
is O(n)
, but if you need to sort first, the sorting is O(n log n)
). Basically the only time this even has a theoretical advantage is when the data is already sorted and you value using a one-liner (excluding imports and one-time setup cost to make itemgetter
s); in practice, itertools.groupby
has sufficient overhead that it still typically loses to one or both of collections.Counter
/collections.defaultdict(int)
, especially when using collections.Counter
in its optimized modes for counting iterables of things to count (that don't apply here, but are worth knowing about).
CodePudding user response:
We could use collections.defaultdict
as well:
from collections import defaultdict
tmp = defaultdict(int)
for d in dat:
tmp[d['id1'], d['id2']] = d['value']
out = [{'id1':id1, 'id2':id2, 'value':v} for (id1, id2), v in tmp.items()]
or (assuming the ids are sorted), itertools.groupby
:
from itertools import groupby
out = [{'id1': k1, 'id2': k2, 'value': sum(d['value'] for d in g)} for (k1,k2), g in groupby(dat, lambda x: (x['id1'], x['id2']))]
or groupby
sum
to_dict
in pandas
:
out = pd.DataFrame(dat).groupby(['id1','id2'], as_index=False)['value'].sum().to_dict('records')
Output:
[{'id1': 1, 'id2': 2, 'value': 3},
{'id1': 2, 'id2': 2, 'value': 2},
{'id1': 2, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 4, 'value': 4}]
A basic benchmark on the provided data says groupby
using itemgetter
(as suggested by @ShadowRanger) is the fastest:
- defaultdict:
6.57 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
- Counter (Bob):
9.56 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
- groupby itemgetter (ShadowRanger):
6.01 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
- groupby lambda:
9.02 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
- pandas:
3.81 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Now if we duplicate dat
1mil times, i.e. Do
dat = dat*1_000_000
dat.sort(key=itemgetter('id1', 'id2'))
and do the same benchmark again, groupby
with itemgetter
is the runaway winner:
3.91 s ± 320 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.38 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.77 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.53 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.2 s ± 831 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
ran on Python 3.9.7 (64bit).
This benchmark somewhat favors groupby
since there are very few groups when we duplicate an existing small list of dicts. If create randomize the size of "group"s, groupby
itemgetter
still beats all but the difference is not as stark.