How to efficiently find a dictionary value based on another value in a list of dictionaries-CodePudding

I have a very large (~100k) list of dictionaries:

[{'sequence': 'read the rest of this note', 'score': 0.22612378001213074, 'token': 3805, 'token_str': 'note'}, {'sequence': 'read the rest of this page', 'score': 0.11293990164995193, 'token': 3674, 'token_str': 'page'}, {'sequence': 'read the rest of this week', 'score': 0.06504543870687485, 'token': 1989, 'token_str': 'week'}]

Given a token ID (e.g 1989), how can I find the corresponding score in an efficient way? I have to do this multiple times for each list (I have several of these large lists and for each one I have several token IDs).

I'm currently iterating through each dictionary in the list and checking if the ID matches my input ID, and if it does I'm getting the score. But it's quite slow.

CodePudding user response：

Since you have to search multiple times maybe create a single dictionary with the token as the key:

a = [{'sequence': 'read the rest of this note', 'score': 0.22612378001213074, 'token': 3805, 'token_str': 'note'}, {'sequence': 'read the rest of this page', 'score': 0.11293990164995193, 'token': 3674, 'token_str': 'page'}, {'sequence': 'read the rest of this week', 'score': 0.06504543870687485, 'token': 1989, 'token_str': 'week'}]

my_dict = {i['token']: i for i in a}

It would take some time to create the dict but after every search would be O(1).

This might seem inefficient but python handles memory in a very efficient way, so instead of creating the same dictionary already on the list on the new dict it actually holds a reference to the dict already constructed on the list, you can confirm that using:

>>> a[0] is my_dict[3805]
True

So you can interpret that as creating an aliases for each element in the list.

CodePudding user response：

Using pandas might be more efficient for large datasets.

An example for finding the score with the token 3805:

import pandas as pd

source_list = [{'sequence': 'read the rest of this note', 'score': 0.22612378001213074, 'token': 3805, 'token_str': 'note'}, {'sequence': 'read the rest of this page', 'score': 0.11293990164995193, 'token': 3674, 'token_str': 'page'}, {'sequence': 'read the rest of this week', 'score': 0.06504543870687485, 'token': 1989, 'token_str': 'week'}]

df = pd.DataFrame(source_list)
result = df[df.token == 3805]

print(result.score.values[0])

CodePudding user response：

If your list of dictionaries is:

l = [{'sequence': 'read the rest of this note', 'score': 0.22612378001213074, 'token': 3805, 'token_str': 'note'}, {'sequence': 'read the rest of this page', 'score': 0.11293990164995193, 'token': 3674, 'token_str': 'page'}, {'sequence': 'read the rest of this week', 'score': 0.06504543870687485, 'token': 1989, 'token_str': 'week'}]

And the values of token you are interested in are, for example:

token_values = [1989, 30897, 98762]

Then:

Build a dictionary as follows:

d = {the_dict['token']: the_dict['score']
    for the_dict in l where the_dict['token'] in token_values}

This will build a minimal dictionary containing just the key values you are interested in with their corresponding scores.