I have a script that retrieves user data from a CSV (~2.5m) and record data from Salesforce via API (~2m) and matches them based on a unique user_id
.
For each user, I need the relevant record_id
(if it exists). There is a one-to-one relationship with users and records, so the user_id
should only appear on 1 record.
To try and increase performance both lists are sorted ascending by user_id
, and I break the loop if record['user_id'] > user['user_id']
as that means there is no relevant record.
It's working, however it's slow when trying to match the 2 datasets taking ~1.5hrs. Is there a faster method of performing the matching to retrieve the relevant record_id
?
Here is an example of the data, current function, and expected result:
users = [
{"user_id": 11111, "name": "Customer A", "age": 34, 'record_id': None},
{"user_id": 22222, "name": "Customer B", "age": 18, 'record_id': None},
{"user_id": 33333, "name": "Customer C", "age": 66, 'record_id': None}
]
records = [
{"user_id": 11111, "record_id": "ABC123"},
{"user_id": 33333, "record_id": "GHI789"}
]
upload = []
for user in users:
for record in records:
if user['user_id'] == record['user_id']:
user['record_id'] = record['record_id']
records.remove(record)
break
elif record['user_id'] > user['user_id']:
break
if user['record_id']:
upload.append(user)
print(upload)
This outputs:
[
{'user_id': 11111, 'name': 'Customer A', 'age': 34, 'record_id': 'ABC123'},
{'user_id': 33333, 'name': 'Customer C', 'age': 66, 'record_id': 'GHI789'}
]
CodePudding user response:
Create a dictionary that maps from a user's id to its corresponding dictionary. Then, you can add the relevant record_id
fields using a for
loop. Finally, you can remove the entries without an assigned record_id
using filter()
.
This doesn't require any preprocessing (e.g. sorting) to obtain speedup; the efficiency gain comes from the fact that lookups in a large dictionary are faster than searching a large list:
user_id_mapping = {entry["user_id"]: entry for entry in users}
for record in records:
if record["user_id"] in user_id_mapping:
user_id_mapping[record["user_id"]]["record_id"] = record["record_id"]
result = [item for item in user_id_mapping.values() if item["record_id"] is not None]
print(result)
This outputs:
[
{'user_id': 11111, 'name': 'Customer A', 'age': 34, 'record_id': 'ABC123'},
{'user_id': 33333, 'name': 'Customer C', 'age': 66, 'record_id': 'GHI789'}
]
With this being said, if you have to execute similar flavors of this operation repeatedly, I would recommend using some sort of a database instead, rather than performing this in Python.
CodePudding user response:
Your approach isn't unreasonable. But removing record
after it's used has a cost. Sorting your two lists ahead of time also has a cost. These costs may add up more than you think they do.
One possible approach would be to NOT sort the lists, but instead build a dict of record_ids, eg:
rdict = { r['user_id']:r['record_id'] for r in records }
for user in users:
user_id = user['user_id']
if user_id in rdict:
user['record_id'] = rdict[user_id]
upload.append(user)
This way you're paying the price once for building the hash, and everything else is very efficient.
CodePudding user response:
For scalability, you can use pandas dataframes, like so:
result = pd.merge(pd.DataFrame(users), pd.DataFrame(records), on='user_id').to_dict('records')
If you want to keep the entries which do not have a record_id
, you can add the how="left"
to the arguments of the merge
function.
CodePudding user response:
You could use pandas.read_csv()
to read your CSV data into a dataframe, and then merge
that with the records
on the user_id
value:
import pandas as pd
users = pd.read_csv('csv file')
records = pd.DataFrame('result of salesforce query')
result = users.drop('record_id', axis=1).merge(records, on='user_id')
If you want to keep the users
which have no matching value in records
, change the merge to
merge(records, on='user_id', how='left')
Note - it may be possible to execute your Salesforce query directly into a dataframe. See for example this Q&A