I have data arriving as dictionaries of lists. In fact, I read in a list of them...
data = [
{
'key1': [101, 102, 103],
'key2': [201, 202, 203],
'key3': [301, 302, 303],
},
{
'key2': [204],
'key3': [304, 305],
'key4': [404, 405, 406],
},
{
'key1': [107, 108],
'key4': [407],
},
]
Each dictionary can have different keys.
Each key associates to a list, of variable length.
What I'd like to do is to make a single dictionary, by concatenating the lists that share a key...
desired_result = {
'key1': [101, 102, 103, 107, 108],
'key2': [201, 202, 203, 204],
'key3': [301, 302, 303, 304, 305],
'key4': [404, 405, 406, 407],
}
Notes:
- Order does not matter
- There are hundreds of dictionaries
- There are dozens of keys per dictionary
- Totalling hundreds of keys in the result set
- Each source list contains dozens of items
I can do this, with comprehensions, but it feels very clunky, and it's actually very slow (looping through all the possible keys for every possible dictionary yield more 'misses' than 'hits')...
{
key: [
item
for d in data
if key in d
for item in d[key]
]
for key in set(
key
for d in data
for key in d.keys()
)
}
# TimeIt gives 3.2, for this small data set
A shorter, easier to read/maintain option is just to loop through everything. But performance still sucks (possibly due to the large number of calls to extend()
, forcing frequent reallocation of memory as the over-provisioned lists fill-up?)...
from collections import defaultdict
result = defaultdict(list)
for d in data:
for key, val in d.items():
result[key].extend(val)
# TimeIt gives 1.7, for this small data set
Is there a better way?
- more 'pythonic'?
- more concise?
- more performant?
Alternatively, is there a more applicable data structure for this type of process?
- I'm sort of making a hash map
- Where each entry is guaranteed to have multiple collisions and so always be a list
Edit: Timing for small data set added
- No timings for real world data, as I don't have access to it from here (err, ooops/sorry...)
CodePudding user response:
Try using list comprehension with setdefault
and extend
method
res={}
[[res.setdefault(key,[]).extend(entry[key]) for key in entry] for entry in data]
print(res)
pretty-print
import json
print(json.dumps(res,sort_keys=True, indent=4))
Result:
{
"key1": [
101,
102,
103,
107,
108
],
"key2": [
201,
202,
203,
204
],
"key3": [
301,
302,
303,
304,
305
],
"key4": [
404,
405,
406,
407
]
}
CodePudding user response:
Perhaps you could use list comprehension and pandas.
No idea if this is a valid answer, or how it performs but it works, for the small set of example data anyway.
import pandas as pd
data = [
{
"key1": [101, 102, 103],
"key2": [201, 202, 203],
"key3": [301, 302, 303],
},
{
"key2": [204],
"key3": [304, 305],
"key4": [404, 405, 406],
},
{
"key1": [107, 108],
"key4": [407],
},
]
dics = [pd.DataFrame.from_dict(el, orient="index") for el in data]
all = pd.concat(dics).fillna("Empty")
all["key"] = all.index
all = all.groupby("key").agg(list)
combined = all.apply(
lambda row: sorted(
[item for sublist in row for item in sublist if item != "Empty"]
),
axis=1,
)
print(combined.to_dict())
{'key1': [101, 102.0, 103.0, 107, 108.0],
'key2': [201, 202.0, 203.0, 204],
'key3': [301, 302.0, 303.0, 304, 305.0],
'key4': [404, 405.0, 406.0, 407]}