I have a very large (~100k) list of dictionaries:
[{'sequence': 'read the rest of this note', 'score': 0.22612378001213074, 'token': 3805, 'token_str': 'note'}, {'sequence': 'read the rest of this page', 'score': 0.11293990164995193, 'token': 3674, 'token_str': 'page'}, {'sequence': 'read the rest of this week', 'score': 0.06504543870687485, 'token': 1989, 'token_str': 'week'}]
Given a token
ID (e.g 1989
), how can I find the corresponding score
in an efficient way? I have to do this multiple times for each list (I have several of these large lists and for each one I have several token IDs).
I'm currently iterating through each dictionary in the list and checking if the ID
matches my input ID, and if it does I'm getting the score
. But it's quite slow.
CodePudding user response:
Since you have to search multiple times maybe create a single dictionary with the token as the key:
a = [{'sequence': 'read the rest of this note', 'score': 0.22612378001213074, 'token': 3805, 'token_str': 'note'}, {'sequence': 'read the rest of this page', 'score': 0.11293990164995193, 'token': 3674, 'token_str': 'page'}, {'sequence': 'read the rest of this week', 'score': 0.06504543870687485, 'token': 1989, 'token_str': 'week'}]
my_dict = {i['token']: i for i in a}
It would take some time to create the dict
but after every search would be O(1)
.
This might seem inefficient but python handles memory in a very efficient way, so instead of creating the same dictionary already on the list
on the new dict
it actually holds a reference to the dict
already constructed on the list, you can confirm that using:
>>> a[0] is my_dict[3805]
True
So you can interpret that as creating an aliases for each element in the list.
CodePudding user response:
Using pandas might be more efficient for large datasets.
An example for finding the score with the token 3805:
import pandas as pd
source_list = [{'sequence': 'read the rest of this note', 'score': 0.22612378001213074, 'token': 3805, 'token_str': 'note'}, {'sequence': 'read the rest of this page', 'score': 0.11293990164995193, 'token': 3674, 'token_str': 'page'}, {'sequence': 'read the rest of this week', 'score': 0.06504543870687485, 'token': 1989, 'token_str': 'week'}]
df = pd.DataFrame(source_list)
result = df[df.token == 3805]
print(result.score.values[0])
CodePudding user response:
If your list of dictionaries is:
l = [{'sequence': 'read the rest of this note', 'score': 0.22612378001213074, 'token': 3805, 'token_str': 'note'}, {'sequence': 'read the rest of this page', 'score': 0.11293990164995193, 'token': 3674, 'token_str': 'page'}, {'sequence': 'read the rest of this week', 'score': 0.06504543870687485, 'token': 1989, 'token_str': 'week'}]
And the values of token
you are interested in are, for example:
token_values = [1989, 30897, 98762]
Then:
Build a dictionary as follows:
d = {the_dict['token']: the_dict['score']
for the_dict in l where the_dict['token'] in token_values}
This will build a minimal dictionary containing just the key values you are interested in with their corresponding scores.