Home > Enterprise >  Dictionary of dictionaries - Removing duplications in only certain keys/values
Dictionary of dictionaries - Removing duplications in only certain keys/values

Time:09-07

I have a dictionary of dictionaries, a sample is below:

my_dictionary = {
    "0": {"Name": "Nick", "Age": 39, "Country": "UK"},
    "1": {"Name": "Steve", "Age": 19, "Country": "Spain"},
    "2": {"Name": "Dave", "Age": 23, "Country": "UK"},
    "3": {"Name": "Nick", "Age": 39, "Country": "Hong Kong"},
    "4": {"Name": "Nick", "Age": 39, "Country": "France"},
}

I want to remove duplicates in my_dictonary if the value in "Name" AND "Age" is the same. It does not matter which one is removed (there could be many that are the same, I only want one version to remain though).

So in our example above, the output would be:

{'0': {'Name': 'Nick', 'Age': 39, 'Country': 'UK'},
 '1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'},
 '2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}}

As Nick, 39 was duplicated despite having a different country.

Is there an easy/efficient way of doing this? I have several million rows.

CodePudding user response:

Track seen records, for example:

my_dictionary = {
    "0": {"Name": "Nick", "Age": 39, "Country": "UK"},
    "1": {"Name": "Steve", "Age": 19, "Country": "Spain"},
    "2": {"Name": "Dave", "Age": 23, "Country": "UK"},
    "3": {"Name": "Nick", "Age": 39, "Country": "Hong Kong"},
    "4": {"Name": "Nick", "Age": 39, "Country": "France"},
}

seen = set()
result = {}
for k, v in my_dictionary.items():
    if (v['Name'], v['Age']) not in seen:
        result[k] = v
        seen.add((v['Name'], v['Age']))

print(result)

Output:

{
    '0': {'Name': 'Nick', 'Age': 39, 'Country': 'UK'}, 
    '1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'}, 
    '2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}
}

Edit note: Using set() (which uses a hash-table) for tracking leads to the overall complexity of O(n) for n rows.

CodePudding user response:

Twice dictionary comprehension, this is easier to write, but it will be slower than using set.

>>> {(v['Name'], v['Age']): k for k, v in my_dictionary.items()}
{('Nick', 39): '4', ('Steve', 19): '1', ('Dave', 23): '2'}
>>> {k: my_dictionary[k] for k in _.values()}
{'4': {'Name': 'Nick', 'Age': 39, 'Country': 'France'},
 '1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'},
 '2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}}
  • Related