Home > Back-end >  How to speed up the slow duplicate check extraction loop between dictionary lists?
How to speed up the slow duplicate check extraction loop between dictionary lists?

Time:04-01

I created the following to retrieve dictionaries that have duplicated specific keys between dictionary lists. If there are 50,000 l2 and so on, the loop of its parent will take a very long time.

for d2 in l2 by itself doesn't take that long. Since for l in list has about 300,000 data items, it takes about 50 minutes after the dust settles.

What are some ways to improve these?

l = [
  {
    "id": 1,
    "name": "John"
  },
  {
    "id": 2,
    "name": "Tom"
  }
]

l2 = [
  {
    "name": "John",
    "gender": "male",
    "country": "USA"
  },
  {
    "name": "Alex",
    "gender": "male"
    "country": "Canada"
  },
  {
    "name": "Sofía",
    "gender": "female"
    "country": "Mexico"
  },
]

Results sought

[
  {
    "name": "Alex",
    "gender": "male"
    "country": "Canada"
  },
  {
    "name": "Sofía",
    "gender": "female"
    "country": "Mexico"
  },
]
new_datas = []
for l in list: # 300k data.
    l2 = [...] # 50k data.
    s = {d["name"] for d in l}

    new_datas.append([
        d2
        for d2 in l2
        if d2["name"] in s
    ])

CodePudding user response:

You could try this:

l = [dic["name"] for dic in l]
cleanList = [dic for dic in l2 if dic["name"] not in l]

Alternatively, you could split the second list, and approach this problem with parallel computing.

CodePudding user response:

I would start by creating a l_lookup (I recommend you use a more expressive name though) via set(). At that point finding out if a name is in the set should be relatively fast.

l = [
    {"id": 1, "name": "John"},
    {"id": 2, "name": "Tom"}
]

l2 = [
    {"name": "John", "gender": "male", "country": "USA"},
    {"name": "Alex", "gender": "male", "country": "Canada"},
    {"name": "Sofía", "gender": "female", "country": "Mexico"},
]

l_lookup = set(x["name"] for x in l)
resutls = [item for item in l2 if item["name"] not in l_lookup]

print(resutls)

Should give you:

[
    {'name': 'Alex', 'gender': 'male', 'country': 'Canada'},
    {'name': 'Sofía', 'gender': 'female', 'country': 'Mexico'}
]
  • Related