Home > Back-end >  Collapse a list of (dictionary of list) to a single (dictionary of list)
Collapse a list of (dictionary of list) to a single (dictionary of list)

Time:11-21

I have data arriving as dictionaries of lists. In fact, I read in a list of them...

data = [
    {
        'key1': [101, 102, 103],
        'key2': [201, 202, 203],
        'key3': [301, 302, 303],
    },
    {
        'key2': [204],
        'key3': [304, 305],
        'key4': [404, 405, 406],
    },
    {
        'key1': [107, 108],
        'key4': [407],
    },
]

Each dictionary can have different keys.

Each key associates to a list, of variable length.

What I'd like to do is to make a single dictionary, by concatenating the lists that share a key...

desired_result = {
    'key1': [101, 102, 103, 107, 108],
    'key2': [201, 202, 203, 204],
    'key3': [301, 302, 303, 304, 305],
    'key4': [404, 405, 406, 407],
}

Notes:

  • Order does not matter
  • There are hundreds of dictionaries
  • There are dozens of keys per dictionary
  • Totalling hundreds of keys in the result set
  • Each source list contains dozens of items

I can do this, with comprehensions, but it feels very clunky, and it's actually very slow (looping through all the possible keys for every possible dictionary yield more 'misses' than 'hits')...

{
  key: [
    item
      for d in data
        if key in d
      for item in d[key]
  ]
    for key in set(
      key
        for d in data
        for key in d.keys()
    )
}

# TimeIt gives 3.2, for this small data set

A shorter, easier to read/maintain option is just to loop through everything. But performance still sucks (possibly due to the large number of calls to extend(), forcing frequent reallocation of memory as the over-provisioned lists fill-up?)...

from collections import defaultdict
result = defaultdict(list)
for d in data:
  for key, val in d.items():
    result[key].extend(val)

# TimeIt gives 1.7, for this small data set

Is there a better way?

  • more 'pythonic'?
  • more concise?
  • more performant?

Alternatively, is there a more applicable data structure for this type of process?

  • I'm sort of making a hash map
  • Where each entry is guaranteed to have multiple collisions and so always be a list

Edit: Timing for small data set added

  • No timings for real world data, as I don't have access to it from here (err, ooops/sorry...)

CodePudding user response:

Try using list comprehension with setdefault and extend method

res={}
[[res.setdefault(key,[]).extend(entry[key]) for key in entry] for entry in data]
print(res)

pretty-print

import json
print(json.dumps(res,sort_keys=True, indent=4))

Result:

{
    "key1": [
        101,
        102,
        103,
        107,
        108
    ],
    "key2": [
        201,
        202,
        203,
        204
    ],
    "key3": [
        301,
        302,
        303,
        304,
        305
    ],
    "key4": [
        404,
        405,
        406,
        407
    ]
}

CodePudding user response:

Perhaps you could use list comprehension and pandas.

No idea if this is a valid answer, or how it performs but it works, for the small set of example data anyway.

import pandas as pd

data = [
    {
        "key1": [101, 102, 103],
        "key2": [201, 202, 203],
        "key3": [301, 302, 303],
    },
    {
        "key2": [204],
        "key3": [304, 305],
        "key4": [404, 405, 406],
    },
    {
        "key1": [107, 108],
        "key4": [407],
    },
]

dics = [pd.DataFrame.from_dict(el, orient="index") for el in data]

all = pd.concat(dics).fillna("Empty")

all["key"] = all.index

all = all.groupby("key").agg(list)

combined = all.apply(
    lambda row: sorted(
        [item for sublist in row for item in sublist if item != "Empty"]
    ),
    axis=1,
)

print(combined.to_dict())

{'key1': [101, 102.0, 103.0, 107, 108.0], 
 'key2': [201, 202.0, 203.0, 204], 
 'key3': [301, 302.0, 303.0, 304, 305.0], 
 'key4': [404, 405.0, 406.0, 407]}
  • Related