Home > Software design >  Find duplicate values in list of dictionaries
Find duplicate values in list of dictionaries

Time:12-21

I need to find dictionaries in a list that have the same key-value, and create a new list in which only the first dictionary is kept.

Example list:

lst_in = [{'First': 1, 'Second': 4}, {'First': 2, 'Second': 5}, {'First': 3, 'Second': 4}]

Duplicate key-value to iterate over should be 'Second'. So in the example the first and third dictionary are the same. I have tried looking at Find duplicates in python list of dictionaries and python list of dictionaries find duplicates based on value, but I can't find the exact answer. I am only looking at one key-value. The dictionaries will always have the same keys.

Expected output:

lst_out = [{'First': 1, 'Second': 4}, {'First': 2, 'Second': 5}]

CodePudding user response:

Some solutions and a benchmark.

Solutions

Fun with a dict, going forwards to get the order and backwards to get the first value.

lst_out = list({d['Second']: d
                for s in [1, -1]
                for d in lst_in[::s]}.values())

Or using setdefault to keep track of each value's first dict:

tmp = {}
for d in lst_in:
    tmp.setdefault(d['Second'], d)
lst_out = list(tmp.values())

Fun and potentially faster version:

add = {}.setdefault
for d in lst_in:
    add(d['Second'], d)
lst_out = list(add.__self__.values())

Benchmark

Times for a list of 1000 dicts with 100 different Second values (using Python 3.10.0):

 361 μs   362 μs   364 μs  dict_forward_backward
 295 μs   297 μs   297 μs  dict_setdefault
 231 μs   231 μs   232 μs  dict_setdefault_optimized
 196 μs   196 μs   197 μs  set_in_list_comprehension
 190 μs   190 μs   190 μs  set_in_list_comprehension_optimized
 191 μs   191 μs   191 μs  set_in_list_comprehension_optimized_2
 201 μs   201 μs   201 μs  set_with_loop
1747 μs  1751 μs  1774 μs  with_lists

Benchmark code:

from timeit import repeat, default_timer as timer
from random import choices

lst_in = [{'First': i, 'Second': v}
          for i, v in enumerate(choices(range(100), k=1000))]

def dict_forward_backward(lst_in):
    return list({d['Second']: d
                 for s in [1, -1]
                 for d in lst_in[::s]}.values())

def dict_setdefault(lst_in):
    tmp = {}
    for d in lst_in:
        tmp.setdefault(d['Second'], d)
    return list(tmp.values())

def dict_setdefault_optimized(lst_in):
    add = {}.setdefault
    for d in lst_in:
        add(d['Second'], d)
    return list(add.__self__.values())

def set_in_list_comprehension(lst_in):
    return [s.add(v) or d
            for s in [set()]
            for d in lst_in
            for v in [d['Second']]
            if v not in s]

def set_in_list_comprehension_optimized(lst_in):
    return [add(v) or d
            for s in [set()]
            for add in [s.add]
            for d in lst_in
            for v in [d['Second']]
            if v not in s]

def set_in_list_comprehension_optimized_2(lst_in):
    s = set()
    add = s.add
    return [add(v) or d
            for d in lst_in
            for v in [d['Second']]
            if v not in s]

def set_with_loop(lst_in):
    found = set()
    lst_out = []
    for dct in lst_in:
        if dct['Second'] not in found:
            lst_out.append(dct)
            found.add( dct['Second'] )
    return lst_out

def with_lists(lst_in):
    out = {'keep':[], 'counter':[]}
    for dct in lst_in:
        if dct['Second'] not in out['counter']:
            out['keep'].append(dct)
            out['counter'].append(dct['Second'])
    return out['keep']

funcs = [
    dict_forward_backward,
    dict_setdefault,
    dict_setdefault_optimized,
    set_in_list_comprehension,
    set_in_list_comprehension_optimized,
    set_in_list_comprehension_optimized_2,
    set_with_loop,
    with_lists,
]

# Correctness
expect = funcs[0](lst_in)
for func in funcs[1:]:
    result = func(lst_in)
    print(result == expect, func.__name__)
print()

# Speed
for _ in range(3):
    for func in funcs:
        ts = sorted(repeat(lambda: func(lst_in), 'gc.enable(); gc.collect()', number=1000))[:3]
        print(*('M μs ' % (t * 1e3) for t in ts), func.__name__)
    print()

CodePudding user response:

Sets are great for those "have I already seen this?" problems.

lst_in = [{'First': 1, 'Second': 4}, {'First': 2, 'Second': 5}, {'First': 3, 'Second': 4}]

found = set()
lst_out = []
for dct in lst_in:
    if dct['Second'] not in found:
        lst_out.append(dct)
        found.add( dct['Second'] )

CodePudding user response:

You can keep a tracker for the Second values and check for it as you add dictionaries to the final list:

lst_in = [{'First': 1, 'Second': 4}, {'First': 2, 'Second': 5}, {'First': 3, 'Second': 4}]
out = {'keep':[], 'counter':[]}
for dct in lst_in:
    if dct['Second'] not in out['counter']:
        out['keep'].append(dct)
        out['counter'].append(dct['Second'])
print(out['keep'])

Output:

[{'First': 1, 'Second': 4}, {'First': 2, 'Second': 5}]
  • Related