I created the following to retrieve dictionaries that have duplicated specific keys between dictionary lists.
If there are 50,000 l2
and so on, the loop of its parent will take a very long time.
for d2 in l2
by itself doesn't take that long.
Since for l in list
has about 300,000 data items, it takes about 50 minutes after the dust settles.
What are some ways to improve these?
l = [
{
"id": 1,
"name": "John"
},
{
"id": 2,
"name": "Tom"
}
]
l2 = [
{
"name": "John",
"gender": "male",
"country": "USA"
},
{
"name": "Alex",
"gender": "male"
"country": "Canada"
},
{
"name": "Sofía",
"gender": "female"
"country": "Mexico"
},
]
Results sought
[
{
"name": "Alex",
"gender": "male"
"country": "Canada"
},
{
"name": "Sofía",
"gender": "female"
"country": "Mexico"
},
]
new_datas = []
for l in list: # 300k data.
l2 = [...] # 50k data.
s = {d["name"] for d in l}
new_datas.append([
d2
for d2 in l2
if d2["name"] in s
])
CodePudding user response:
You could try this:
l = [dic["name"] for dic in l]
cleanList = [dic for dic in l2 if dic["name"] not in l]
Alternatively, you could split the second list, and approach this problem with parallel computing.
CodePudding user response:
I would start by creating a l_lookup
(I recommend you use a more expressive name though) via set()
. At that point finding out if a name is in the set should be relatively fast.
l = [
{"id": 1, "name": "John"},
{"id": 2, "name": "Tom"}
]
l2 = [
{"name": "John", "gender": "male", "country": "USA"},
{"name": "Alex", "gender": "male", "country": "Canada"},
{"name": "Sofía", "gender": "female", "country": "Mexico"},
]
l_lookup = set(x["name"] for x in l)
resutls = [item for item in l2 if item["name"] not in l_lookup]
print(resutls)
Should give you:
[
{'name': 'Alex', 'gender': 'male', 'country': 'Canada'},
{'name': 'Sofía', 'gender': 'female', 'country': 'Mexico'}
]