Dedupe a list of dicts where the match criteria is multiple key value pairs being identical-CodePudding

For the given sample input list, I want to dedupe the dicts based on the values of the keys code, tc, signal, and in_force all matching.

sample input:

signals = [
    None,
    None,
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 1, 'target': 0},
    {'code': 'lr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 2, 'target': 1},
    {'code': 'sr', 'tc': 1, 'signal': '2U-2D', 'in_force': True, 'trigger': 3, 'target': 2},
    None,
    {'code': 'sr', 'tc': 0, 'signal': '1-2U-2D', 'in_force': True, 'trigger': 4, 'target': 3},
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': False, 'trigger': 5, 'target': 4},
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 6, 'target': 5},
    None,
    {'code': 'lr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 7, 'target': 6},
    {'code': 'sr', 'tc': 1, 'signal': '2U-2D', 'in_force': True, 'trigger': 8, 'target': 7},
    {'code': 'sr', 'tc': 0, 'signal': '1-2U-2D', 'in_force': True, 'trigger': 9, 'target': 8},
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': False, 'trigger': 0, 'target': 9},
]

expected/desired output:

[
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 1, 'target': 0},
    {'code': 'lr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 2, 'target': 1},
    {'code': 'sr', 'tc': 1, 'signal': '2U-2D', 'in_force': True, 'trigger': 3, 'target': 2},
    {'code': 'sr', 'tc': 0, 'signal': '1-2U-2D', 'in_force': True, 'trigger': 4, 'target': 3},
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': False, 'trigger': 5, 'target': 4},
]

The order of the list does not need to be preserved, and whether it returns the 1st or nth matching dict in the list does not matter.

I could make a very verbose version of this reference code that creates each list of matching key/values, but I feel like there's got to be a better way.

new_list = []
for position, signal in enumerate(signals):
    if type(signal) == dict:
            if {
                key: value
                for key, value in signal.items()
                if signal["code"] == "sr"
                and signal["tc"] == 0
                and signal["signal"] == "2U-2D"
                and signal["in_force"] == True
            }:
                new_list.append(signal)

CodePudding user response：

I'd suggest something like this, with only Python's standard library:

result = []
seen = set()
for s in signals:
  if not isinstance(s, dict): continue
  signature = (s['code'], s['tc'], s['signal'], s['in_force'])
  if signature in seen: continue
  seen.add(signature)
  result.append(s)

CodePudding user response：

I don't know if that is wanted but pandas could be come in quite handy here. Also if you have some other tasks to do with the data, a dataframe is a convenient way to do it.

import pandas as pd
# filter None to only have a list of dicts, then create a df with it
df = pd.DataFrame(filter(None,signals)) 

out = df.drop_duplicates(subset=['code', 'tc', 'signal', 'in_force'], keep='first')

out.to_dict('records')

Output:

[{'code': 'sr',
  'tc': 0,
  'signal': '2U-2D',
  'in_force': True,
  'trigger': 1,
  'target': 0},
 {'code': 'lr',
  'tc': 0,
  'signal': '2U-2D',
  'in_force': True,
  'trigger': 2,
  'target': 1},
 {'code': 'sr',
  'tc': 1,
  'signal': '2U-2D',
  'in_force': True,
  'trigger': 3,
  'target': 2},
 {'code': 'sr',
  'tc': 0,
  'signal': '1-2U-2D',
  'in_force': True,
  'trigger': 4,
  'target': 3},
 {'code': 'sr',
  'tc': 0,
  'signal': '2U-2D',
  'in_force': False,
  'trigger': 5,
  'target': 4}]

CodePudding user response：

import pandas as pd
new_list = pd.Series([s for s in signals if isinstance(s, dict)])
keys = ['code', 'tc', 'signal', 'in_force']
idx = new_list.apply(lambda x: {x[k] for k in keys}).duplicated()
new_list = new_list[idx].tolist()

CodePudding user response：

You could use pandas dataframe to drop duplicates using df.duplicated()

import pandas as pd

signals = [
    None,
    None,
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 1, 'target': 0},
    {'code': 'lr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 2, 'target': 1},
    {'code': 'sr', 'tc': 1, 'signal': '2U-2D', 'in_force': True, 'trigger': 3, 'target': 2},
    None,
    {'code': 'sr', 'tc': 0, 'signal': '1-2U-2D', 'in_force': True, 'trigger': 4, 'target': 3},
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': False, 'trigger': 5, 'target': 4},
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 6, 'target': 5},
    None,
    {'code': 'lr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 7, 'target': 6},
    {'code': 'sr', 'tc': 1, 'signal': '2U-2D', 'in_force': True, 'trigger': 8, 'target': 7},
    {'code': 'sr', 'tc': 0, 'signal': '1-2U-2D', 'in_force': True, 'trigger': 9, 'target': 8},
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': False, 'trigger': 0, 'target': 9},
]
signals = [x for x in signals if x != None]
df = pd.DataFrame(signals)
df1 = df[df.duplicated(['code', 'tc', 'signal', 'in_force'])]
print(df1)

  code  tc   signal  in_force  trigger  target
5   sr   0    2U-2D      True        6       5
6   lr   0    2U-2D      True        7       6
7   sr   1    2U-2D      True        8       7
8   sr   0  1-2U-2D      True        9       8
9   sr   0    2U-2D     False        0       9

And if you need the output to be a list of dictionary, you could do

df1.to_dict()

{'code': {5: 'sr', 6: 'lr', 7: 'sr', 8: 'sr', 9: 'sr'},
 'tc': {5: 0, 6: 0, 7: 1, 8: 0, 9: 0},
 'signal': {5: '2U-2D', 6: '2U-2D', 7: '2U-2D', 8: '1-2U-2D', 9: '2U-2D'},
 'in_force': {5: True, 6: True, 7: True, 8: True, 9: False},
 'trigger': {5: 6, 6: 7, 7: 8, 8: 9, 9: 0},
 'target': {5: 5, 6: 6, 7: 7, 8: 8, 9: 9}}

CodePudding user response：

I found a solution that fits into 1 line of code and does not use any external libraries.

To begin with, let's filter out all None values:

signals = filter(lambda x: not x is None, signals)

signals = [signal for signal in signals if not signal is None]

Now let's create a dict where keys will be string repr representations of code, tc, signal, and in_force values of our input dicts (this should work until there's only simple types of values) and the values will be the complete dicts (consistent of all keys). As a dict may not contain several equal keys, all the duplications will be gone:

filter_dict = {repr([signal[key] for key in ('code', 'tc', 'signal', 'in_force')]): signal for signal in signals}

Here's what I've got at this point:

{
    "['sr', 0, '2U-2D', True]": {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 6, 'target': 5},
    "['lr', 0, '2U-2D', True]": {'code': 'lr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 7, 'target': 6},
    "['sr', 1, '2U-2D', True]": {'code': 'sr', 'tc': 1, 'signal': '2U-2D', 'in_force': True, 'trigger': 8, 'target': 7},
    "['sr', 0, '1-2U-2D', True]": {'code': 'sr', 'tc': 0, 'signal': '1-2U-2D', 'in_force': True, 'trigger': 9, 'target': 8},
    "['sr', 0, '2U-2D', False]": {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': False, 'trigger': 0, 'target': 9}
}

Now let's just take the values of that dict, and its all done!:

result = list(filter_dict.values())

All these steps may be joined into 1 line of code:

result = list({repr([signal[key] for key in ('code', 'tc', 'signal', 'in_force')]): signal for signal in signals if not signal is None}.values())

Final result:

[
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 6, 'target': 5},
    {'code': 'lr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 7, 'target': 6},
    {'code': 'sr', 'tc': 1, 'signal': '2U-2D', 'in_force': True, 'trigger': 8, 'target': 7},
    {'code': 'sr', 'tc': 0, 'signal': '1-2U-2D', 'in_force': True, 'trigger': 9, 'target': 8},
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': False, 'trigger': 0, 'target': 9}
]

May be my solution is not fastest (because I'm using strings) and it may not work with all possible classes that may be in the original dicts (because some classes may not convert into strings correctly by repr function). But at least it's very simple.

CodePudding user response：

Use filter to skip the None entries and keep tuples of "seen" values in a set for efficient checking.

import operator

seen = set()
clean = []

# Function to get the values for the keys that we are interested in.
getter = operator.itemgetter('code', 'tc', 'signal', 'in_force')

for signal in filter(None, signals):
    if (vals := getter(signal)) in seen:
        # We have already got a dict with these values - skip.
        continue
    seen.add(vals)
    clean.append(signal)

assert len(clean) == len(expected)
assert all(item in expected for item in clean)