Fastest way to match 2 lists of dicts on a key value-CodePudding

I have a script that retrieves user data from a CSV (~2.5m) and record data from Salesforce via API (~2m) and matches them based on a unique user_id.

For each user, I need the relevant record_id (if it exists). There is a one-to-one relationship with users and records, so the user_id should only appear on 1 record.

To try and increase performance both lists are sorted ascending by user_id, and I break the loop if record['user_id'] > user['user_id'] as that means there is no relevant record.

It's working, however it's slow when trying to match the 2 datasets taking ~1.5hrs. Is there a faster method of performing the matching to retrieve the relevant record_id?

Here is an example of the data, current function, and expected result:

users = [
    {"user_id": 11111, "name": "Customer A", "age": 34, 'record_id': None},
    {"user_id": 22222, "name": "Customer B", "age": 18, 'record_id': None},
    {"user_id": 33333, "name": "Customer C", "age": 66, 'record_id': None}
]

records = [
    {"user_id": 11111, "record_id": "ABC123"},
    {"user_id": 33333, "record_id": "GHI789"}
]

upload = []
for user in users:
    for record in records:
        if user['user_id'] == record['user_id']:
            user['record_id'] = record['record_id']
            records.remove(record)
            break
        elif record['user_id'] > user['user_id']:
            break
    if user['record_id']:
        upload.append(user)

print(upload)

This outputs:

[
 {'user_id': 11111, 'name': 'Customer A', 'age': 34, 'record_id': 'ABC123'}, 
 {'user_id': 33333, 'name': 'Customer C', 'age': 66, 'record_id': 'GHI789'}
]

CodePudding user response：

Create a dictionary that maps from a user's id to its corresponding dictionary. Then, you can add the relevant record_id fields using a for loop. Finally, you can remove the entries without an assigned record_id using filter().

This doesn't require any preprocessing (e.g. sorting) to obtain speedup; the efficiency gain comes from the fact that lookups in a large dictionary are faster than searching a large list:

user_id_mapping = {entry["user_id"]: entry for entry in users}

for record in records:
    if record["user_id"] in user_id_mapping:
        user_id_mapping[record["user_id"]]["record_id"] = record["record_id"]

result = [item for item in user_id_mapping.values() if item["record_id"] is not None]

print(result)

This outputs:

[
 {'user_id': 11111, 'name': 'Customer A', 'age': 34, 'record_id': 'ABC123'}, 
 {'user_id': 33333, 'name': 'Customer C', 'age': 66, 'record_id': 'GHI789'}
]

With this being said, if you have to execute similar flavors of this operation repeatedly, I would recommend using some sort of a database instead, rather than performing this in Python.

CodePudding user response：

Your approach isn't unreasonable. But removing record after it's used has a cost. Sorting your two lists ahead of time also has a cost. These costs may add up more than you think they do.

One possible approach would be to NOT sort the lists, but instead build a dict of record_ids, eg:

rdict = { r['user_id']:r['record_id'] for r in records }
for user in users:
    user_id = user['user_id']
    if user_id in rdict:
        user['record_id'] = rdict[user_id]
        upload.append(user)

This way you're paying the price once for building the hash, and everything else is very efficient.

CodePudding user response：

For scalability, you can use pandas dataframes, like so:

result = pd.merge(pd.DataFrame(users), pd.DataFrame(records), on='user_id').to_dict('records')

If you want to keep the entries which do not have a record_id, you can add the how="left" to the arguments of the merge function.

CodePudding user response：

You could use pandas.read_csv() to read your CSV data into a dataframe, and then merge that with the records on the user_id value:

import pandas as pd

users = pd.read_csv('csv file')
records = pd.DataFrame('result of salesforce query')

result = users.drop('record_id', axis=1).merge(records, on='user_id')

If you want to keep the users which have no matching value in records, change the merge to

merge(records, on='user_id', how='left')

Note - it may be possible to execute your Salesforce query directly into a dataframe. See for example this Q&A