I have a dictionary of dictionaries, a sample is below:
my_dictionary = {
"0": {"Name": "Nick", "Age": 39, "Country": "UK"},
"1": {"Name": "Steve", "Age": 19, "Country": "Spain"},
"2": {"Name": "Dave", "Age": 23, "Country": "UK"},
"3": {"Name": "Nick", "Age": 39, "Country": "Hong Kong"},
"4": {"Name": "Nick", "Age": 39, "Country": "France"},
}
I want to remove duplicates in my_dictonary
if the value in "Name"
AND "Age"
is the same. It does not matter which one is removed (there could be many that are the same, I only want one version to remain though).
So in our example above, the output would be:
{'0': {'Name': 'Nick', 'Age': 39, 'Country': 'UK'},
'1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'},
'2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}}
As Nick, 39
was duplicated despite having a different country.
Is there an easy/efficient way of doing this? I have several million rows.
CodePudding user response:
Track seen records, for example:
my_dictionary = {
"0": {"Name": "Nick", "Age": 39, "Country": "UK"},
"1": {"Name": "Steve", "Age": 19, "Country": "Spain"},
"2": {"Name": "Dave", "Age": 23, "Country": "UK"},
"3": {"Name": "Nick", "Age": 39, "Country": "Hong Kong"},
"4": {"Name": "Nick", "Age": 39, "Country": "France"},
}
seen = set()
result = {}
for k, v in my_dictionary.items():
if (v['Name'], v['Age']) not in seen:
result[k] = v
seen.add((v['Name'], v['Age']))
print(result)
Output:
{
'0': {'Name': 'Nick', 'Age': 39, 'Country': 'UK'},
'1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'},
'2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}
}
Edit note: Using set()
(which uses a hash-table) for tracking leads to the overall complexity of O(n)
for n rows.
CodePudding user response:
Twice dictionary comprehension, this is easier to write, but it will be slower than using set.
>>> {(v['Name'], v['Age']): k for k, v in my_dictionary.items()}
{('Nick', 39): '4', ('Steve', 19): '1', ('Dave', 23): '2'}
>>> {k: my_dictionary[k] for k in _.values()}
{'4': {'Name': 'Nick', 'Age': 39, 'Country': 'France'},
'1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'},
'2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}}