Collapse a list of (dictionary of list) to a single (dictionary of list)-CodePudding

I have data arriving as dictionaries of lists. In fact, I read in a list of them...

data = [
    {
        'key1': [101, 102, 103],
        'key2': [201, 202, 203],
        'key3': [301, 302, 303],
    },
    {
        'key2': [204],
        'key3': [304, 305],
        'key4': [404, 405, 406],
    },
    {
        'key1': [107, 108],
        'key4': [407],
    },
]

Each dictionary can have different keys.

Each key associates to a list, of variable length.

What I'd like to do is to make a single dictionary, by concatenating the lists that share a key...

desired_result = {
    'key1': [101, 102, 103, 107, 108],
    'key2': [201, 202, 203, 204],
    'key3': [301, 302, 303, 304, 305],
    'key4': [404, 405, 406, 407],
}

Notes:

Order does not matter
There are hundreds of dictionaries
There are dozens of keys per dictionary
Totalling hundreds of keys in the result set
Each source list contains dozens of items

I can do this, with comprehensions, but it feels very clunky, and it's actually very slow (looping through all the possible keys for every possible dictionary yield more 'misses' than 'hits')...

{
  key: [
    item
      for d in data
        if key in d
      for item in d[key]
  ]
    for key in set(
      key
        for d in data
        for key in d.keys()
    )
}

# TimeIt gives 3.2, for this small data set

A shorter, easier to read/maintain option is just to loop through everything. But performance still sucks (possibly due to the large number of calls to extend(), forcing frequent reallocation of memory as the over-provisioned lists fill-up?)...

from collections import defaultdict
result = defaultdict(list)
for d in data:
  for key, val in d.items():
    result[key].extend(val)

# TimeIt gives 1.7, for this small data set

Is there a better way?

more 'pythonic'?
more concise?
more performant?

Alternatively, is there a more applicable data structure for this type of process?

I'm sort of making a hash map
Where each entry is guaranteed to have multiple collisions and so always be a list

Edit: Timing for small data set added

No timings for real world data, as I don't have access to it from here (err, ooops/sorry...)

CodePudding user response：

Try using list comprehension with setdefault and extend method

res={}
[[res.setdefault(key,[]).extend(entry[key]) for key in entry] for entry in data]
print(res)

pretty-print

import json
print(json.dumps(res,sort_keys=True, indent=4))

Result:

CodePudding user response：

Perhaps you could use list comprehension and pandas.

No idea if this is a valid answer, or how it performs but it works, for the small set of example data anyway.

import pandas as pd

data = [
    {
        "key1": [101, 102, 103],
        "key2": [201, 202, 203],
        "key3": [301, 302, 303],
    },
    {
        "key2": [204],
        "key3": [304, 305],
        "key4": [404, 405, 406],
    },
    {
        "key1": [107, 108],
        "key4": [407],
    },
]

dics = [pd.DataFrame.from_dict(el, orient="index") for el in data]

all = pd.concat(dics).fillna("Empty")

all["key"] = all.index

all = all.groupby("key").agg(list)

combined = all.apply(
    lambda row: sorted(
        [item for sublist in row for item in sublist if item != "Empty"]
    ),
    axis=1,
)

print(combined.to_dict())

{'key1': [101, 102.0, 103.0, 107, 108.0], 
 'key2': [201, 202.0, 203.0, 204], 
 'key3': [301, 302.0, 303.0, 304, 305.0], 
 'key4': [404, 405.0, 406.0, 407]}