accelerate comparing dictionary keys and values to strings in list in python-CodePudding

Sorry if this is trivial I'm still learning but I have a list of dictionaries that looks as follow:

[{'1102': ['00576', '00577', '00578', '00579', '00580', '00581']},
 {'1102': ['00582', '00583', '00584', '00585', '00586', '00587']},
 {'1102': ['00588', '00589', '00590', '00591', '00592', '00593']},
 {'1102': ['00594', '00595', '00596', '00597', '00598', '00599']},
 {'1102': ['00600', '00601', '00602', '00603', '00604', '00605']}
 ...]

it contains ~89000 dictionaries. And I have a list containing 4473208 paths. example:

['/****/**/******_1102/00575***...**0CT.csv',
'/****/**/******_1102/00575***...**1CT.csv',
'/****/**/******_1102/00575***...**2CT.csv',
'/****/**/******_1102/00575***...**3CT.csv',
'/****/**/******_1102/00575***...**4CT.csv',
'/****/**/******_1102/00578***...**1CT.csv',
'/****/**/******_1102/00578***...**2CT.csv',
'/****/**/******_1102/00578***...**3CT.csv',
 ...]

and what I want to do is group each path that contains the grouped values in the dict in the folder containing the key together.

I tried using for loops like this:

grpd_cts = []
   
for elem in tqdm(dict_list):
    temp1 = []
    for file in ct_paths:
        for key, val in elem.items():
            if (file[16:20] == key) and (any(x in file[21:26] for x in val)):
                temp1.append(file)

    grpd_cts.append(temp1)

but this takes around 30hours. is there a way to make it more efficient? any itertools function or something?

Thanks a lot!

CodePudding user response：

ct_paths is iterated repeatedly in your inner loop, and you're only interested in a little bit of it for testing purposes; pull that out and use it to index the rest of your data, as a dictionary.

What does make your problem complicated is that you're wanting to end up with the original list of filenames, so you need to construct a two-level dictionary where the values are lists of all originals grouped under those two keys.

ct_path_index = {}
for f in ct_paths:
    ct_path_index.setdefault(f[16:20], {}).setdefault(f[21:26], []).append(f)

grpd_cts = []
for elem in tqdm(dict_list):
    temp1 = []
    for key, val in elem.items():
        d2 = ct_path_index.get(key)
        if d2:
            for v in val:
                v2 = d2.get(v)
                if v2:
                    temp1  = v2
    grpd_cts.append(temp1)

ct_path_index looks like this, using your data:

{'1102': {'00575': ['/****/**/******_1102/00575***...**0CT.csv',
   '/****/**/******_1102/00575***...**1CT.csv',
   '/****/**/******_1102/00575***...**2CT.csv',
   '/****/**/******_1102/00575***...**3CT.csv',
   '/****/**/******_1102/00575***...**4CT.csv'],
  '00578': ['/****/**/******_1102/00578***...**1CT.csv',
   '/****/**/******_1102/00578***...**2CT.csv',
   '/****/**/******_1102/00578***...**3CT.csv']}}

The use of setdefault (which can be a little hard to understand the first time you see it) is important when building up collections of collections, and is very common in these kinds of cases: it makes sure that the sub-collections are created on demand and then re-used for a given key.

Now, you've only got two nested loops; the inner checks are done using dictionary lookups, which are close to O(1).

Other optimizations would include turning the lists in dict_list into sets, which would be worthwhile if you made more than one pass through dict_list.