Home > Software design >  How to extact elements from dictionary of lists efficiently?
How to extact elements from dictionary of lists efficiently?

Time:09-23

Here is my initial dictionary:

dic = {'key1': [2,3],
       'key2': [5,1],
       'key3': [6,8]}

NB: I am using simple numbers 2,3, etc. for the purpose of the example (list of dfs on my side).

For each key, I would like to extract the first element and obtain the following result:

dic2 = {'key1': 2,
        'key2': 5,
        'key3': 6}

Is it possible to do it without going for a slow for loop ? The dictionary is quite large...

Many thanks in advance for your help.

CodePudding user response:

If you expect to access only a few keys of the dictionary after this conversion, you could write a wrapper; something like:

class ViewFirst:
  def __init__(self, original):
    self.original = original
  def __getitem__(self, key):
    return self.original[key][0]

Another option is to base it on defaultdict; that will let you do things like assigning new values into the dictionary (new or existing keys) while still retrieving other values from the original dictionary:

class DictFromFirsts(collections.defaultdict):
  def __init__(self, original):
    self.original = original
  def __missing__(self, key):
    return self.original[key][0]

CodePudding user response:

A dictionary comprehension is the way to go here:

{k: v[0] for k, v in dic.items()}

Or use operator.itemgetter:

>>> from operator import itemgetter
>>> dict(zip(dic, map(itemgetter(0), dic.values())))
{'key1': 2, 'key2': 5, 'key3': 6}
>>> 

CodePudding user response:

you can try this:

dict(zip(dic.keys(),list(list(zip(*dic.values()))[0])))

Output:

{'key1': 2, 'key2': 5, 'key3': 6}

CodePudding user response:

Let's see how "slow" a for loop really is here. My solution:

dic2 = {}
for k, (dic2[k], _) in dic.items():
    pass

Benchmark with your toy dict:

    600 ns      602 ns      603 ns  U12_Forward_1
   1019 ns     1025 ns     1027 ns  U12_Forward_2
   1347 ns     1350 ns     1355 ns  user1740577
    441 ns      442 ns      443 ns  dont_talk_just_code

Benchmark with a "large" dict of 12k items as you commented:

1412624 ns  1414927 ns  1418089 ns  U12_Forward_1
1687464 ns  1690134 ns  1696759 ns  U12_Forward_2
1961205 ns  1986729 ns  2005884 ns  user1740577
1248901 ns  1260306 ns  1261295 ns  dont_talk_just_code

The above were done on tio.run, which I use for its high speed and speed stability. Sadly it doesn't offer cytoolz needed by Marco's answer, so I couldn't include that. Then @user1740577 accused me of lying because I didn't include it, so here are results from replit.com, where I can (note that if you run it there and don't have a paid account, your times will be slower):

    475 ns      476 ns      484 ns  U12_Forward_1
    804 ns      805 ns      807 ns  U12_Forward_2
   1075 ns     1079 ns     1082 ns  user1740577
    442 ns      444 ns      448 ns  Marco_DG
    360 ns      360 ns      360 ns  dont_talk_just_code

1060461 ns  1061449 ns  1071588 ns  U12_Forward_1
1294079 ns  1330157 ns  1706065 ns  U12_Forward_2
1593082 ns  1594114 ns  1596703 ns  user1740577
1268663 ns  1274264 ns  1286715 ns  Marco_DG
 964445 ns   965971 ns   966333 ns  dont_talk_just_code

Full benchmark code (also at replit:

from timeit import repeat
from functools import partial
from operator import itemgetter
from cytoolz import valmap , first

def U12_Forward_1(dic):
    return {k: v[0] for k, v in dic.items()}

def U12_Forward_2(dic):
    return dict(zip(dic, map(itemgetter(0), dic.values())))

def user1740577(dic):
    return dict(zip(dic.keys(),list(list(zip(*dic.values()))[0])))

def Marco_DG(dic):
    return valmap(first, dic)

def dont_talk_just_code(dic):
    dic2 = {}
    for k, (dic2[k], _) in dic.items():
        pass
    return dic2

funcs = U12_Forward_1, U12_Forward_2, user1740577, Marco_DG, dont_talk_just_code

def bench(dic, number):
    expect = funcs[0](dic)
    for func in funcs:
        result = func(dic)
        print(result == expect, func.__name__)
    print()

    for _ in range(3):
        for func in funcs:
            ts = sorted(repeat(partial(func, dic), number=number))[:3]
            print(*('} ns ' % (t / number * 1e9) for t in ts), func.__name__)
        print()

bench({'key1': [2,3], 'key2': [5,1], 'key3': [6,8]}, 100000)
bench({f'key{i}': [i,42] for i in range(12000)}, 30)

CodePudding user response:

Personally I would go with a library that does that using Cython under the hood: cytoolz

pip3 install cytoolz
from cytoolz import valmap , first

dic = {'key1': [2,3],
       'key2': [5,1],
       'key3': [6,8]}


dic2 = valmap(first, dic)

I will extend @don'ttalkjustcode tests with my function benchmark, cleary his one is the fastest one, with the only caveat of being limited to 2-elements arrays. I'm still trying to figure out how to test the one by @Jiří Baum.

    422 ns      422 ns      422 ns  marco_dg
    462 ns      462 ns      462 ns  U12_Forward_1
    765 ns      766 ns      769 ns  U12_Forward_2
   1076 ns     1081 ns     1088 ns  user1740577
    341 ns      341 ns      341 ns  dont_talk_just_code

.

1206537 ns  1208705 ns  1211105 ns  marco_dg
1009374 ns  1011324 ns  1011989 ns  U12_Forward_1
1232356 ns  1232728 ns  1251990 ns  U12_Forward_2
1380953 ns  1382381 ns  1390140 ns  user1740577
 848863 ns   850010 ns   850450 ns  dont_talk_just_code
  • Related