Updating dictionaries based on huge list of dicts-CodePudding

I have a huge(around 350k elements) list of dictionaries:

lst = [
{'data': 'xxx', 'id': 1456},
{'data': 'yyy', 'id': 24234},
{'data': 'zzz', 'id': 3222},
{'data': 'foo', 'id': 1789},
]

On the other hand I receive dictionaries(around 550k) one by one with missing value(not every dict is missing this value) which I need to update from:

example_dict = {'key': 'x', 'key2': 'y', 'id': 1456, 'data': None}

To:

example_dict = {'key': 'x', 'key2': 'y', 'id': 1456, 'data': 'xxx'}

And I need to take each dict and search withing the list for matching 'id' and update the 'data'. Doing it this way takes ages to process:

if example_dict['data'] is None:
    for row in lst:
        if row['id'] == example_dict['id']:
            example_dict['data'] = row['data']

Is there a way to build a structured chunked data divided to in e.g. 10k values and tell the incoming dict in which chunk to search for the 'id'? Or any other way to to optimize that? Any help is appreciated, take care.

CodePudding user response：

Use a dict instead of searching linearly through the list.

The first important optimization is to remove that linear search through lst, by building a dict indexed on id pointing to the rows

For example, this will be a lot faster than your code, if you have enough RAM to hold all the rows in memory:

row_dict = {row['id']: row for row in lst}

if example_dict['data'] is None:
    if example_dict['id'] in row_dict:
        example_dict['data'] = row_dict[example_dict['id']]['data']

This improvement will be relevant for you whether you process the rows by chunks of 10k or all at once, since dictionary lookup time is constant instead of linear in the size of lst.

Make your own chunking process

Next you ask "Is there a way to build a structured chunked data divided...". Yes, absolutely. If the data is too big to fit in memory, write a first pass function that divides the input based on id into several temporary files. They could be based on the last two digits of ID if order is irrelevant, or on ranges of ids if you prefer. Do that for both the list of rows, and the dictionaries you receive, and then process each list/dict file pairs on the same ids one at a time, with code like above.

If you have to preserve the order in which you receive the dictionaries, though, this approach will be more difficult to implement.

CodePudding user response：

Some preprocessing of lst list might help a lot. E.g. transform that list of dicts into dictionary, where id would be a key.

To be precise transform lst into such structure:

lst = {
    '1456': 'xxx',
    '24234': 'yyy',
    '3222': 'zzz',
    ...
}

Then when trying to check your data attributes in example_dict, just access straight to id key in lst as follows:

if example_dict['data'] is None:
    example_dict['data'] = lst.get(example_dict['id'])

It should reduce the time complexity from something as quadratic complexity (n*n) to linear complexity (n).

CodePudding user response：

Try creating creating a hash table (in Python, a dict) from lst to speed up the lookup based on 'id':

        lst = [
            {'data': 'xxx', 'id': 1456},
            {'data': 'yyy', 'id': 24234},
            {'data': 'zzz', 'id': 3222},
            {'data': 'foo', 'id': 1789},
            ]        
        example_dict = {'key': 'x', 'key2': 'y', 'id': 1456, 'data': None}        
        dct ={row['id'] : row for row in lst}
        if example_dict['data'] is None:
            example_dict['data'] = dct[example_dict['id']]['data']
        print(example_dict)

Sample output:

{'key': 'x', 'key2': 'y', 'id': 1456, 'data': 'xxx'}