I have lots of dictionary in one list. For example;
totalList = [
{'id': 1111, 'source': 'user_1', 'count_id': 10, 'description': 'aaaa'},
{'id': 1412, 'source': 'user_2', 'count_id': 5, 'description': 'bbbb'},
{'id': 5123, 'source': 'user_1', 'count_id': 10, 'description': 'aaaa'},
{'id': 1982, 'source': 'user_3', 'count_id': 7, 'description': 'bbbb'},
{'id': 3198, 'source': 'user_3', 'count_id': 7, 'description': 'bbbb'},
{'id': 1082, 'source': 'user_1', 'count_id': 10, 'description': 'aaaa'}
]
- The id's are always different.
- All keys are the same.
I want to get id's that have the same source, same count_id and same description values. In this example, I just need to get the id's. Output:
1111, 5123, 1082 same
1982, 3198 same
How can i achieve this?
Thanks.
CodePudding user response:
I'd reformat the data into a dictionary of items, where each key is a tuple of the three values you care about. Then you can iterate through the dictionary and efficiently find duplicates.
# Original data
totalList = [
{'id': 1111, 'source': 'user_1', 'count_id': 10, 'description': 'aaaa'},
{'id': 1412, 'source': 'user_2', 'count_id': 5, 'description': 'bbbb'},
{'id': 5123, 'source': 'user_1', 'count_id': 10, 'description': 'aaaa'},
{'id': 1982, 'source': 'user_3', 'count_id': 7, 'description': 'bbbb'},
{'id': 3198, 'source': 'user_3', 'count_id': 7, 'description': 'bbbb'},
{'id': 1082, 'source': 'user_1', 'count_id': 10, 'description': 'aaaa'}
]
# Detect duplicates
from collections import defaultdict
def get_key(item):
return (item['source'], item['count_id'], item['description'])
ids_by_source_count_and_desc = defaultdict(list)
for item in totalList:
ids_by_source_count_and_desc[get_key(item)].append(item['id'])
for key in ids_by_source_count_and_desc:
ids = ids_by_source_count_and_desc[key]
if len(ids) > 1:
print(key, "same", ids)
I also use defaultdict
to avoid having to check if the dictionary I'm inserting into already contains a list.
Output:
('user_1', 10, 'aaaa') same [1111, 5123, 1082]
('user_3', 7, 'bbbb') same [1982, 3198]
CodePudding user response:
Personally speaking, working with pandas
mostly can make coding much faster and simpler. What I have come up with is as what follows:
import pandas as pd
df = pd.DataFrame(totalList)
result = {}
groups = df.groupby(by=["source", "count_id", "description"])["id"]
for name, group in groups:
tempList = group.tolist()
if len(tempList) > 1:
result[name] = group.tolist()
result
Ouput
{('user_1', 10, 'aaaa'): [1111, 5123, 1082],
('user_3', 7, 'bbbb'): [1982, 3198]}
To get the same output as the one mentioned your answer, you just need to loop over the result
variable and use join
function on the list:
for key, value in result.items():
print(",".join(str(v) for v in value) " same")
Final Output
1111,5123,1082 same
1982,3198 same
Note that, we need to use str(v) for v in value
in the join function since the value
does not contain strings, rather it contains just floats.