Home > Net >  Groupby a list of dicts with multiple keys doesn't work even if I use sorted before
Groupby a list of dicts with multiple keys doesn't work even if I use sorted before

Time:06-09

I have a list of dictionaries like that :

a = [
    {'user_id':'111','clean_label':'VIR SEPA'},
    {'user_id':'112','clean_label':'VIR SEPA'},
    {'user_id':'111','clean_label':'VIR SEPA'},
]

and I want that :

a = [
    [
        {'user_id':'111','clean_label':'VIR SEPA'},
        {'user_id':'111','clean_label':'VIR SEPA'}
    ],
    [
        {'user_id':'112','clean_label':'VIR SEPA'}
    ]
]

I tried with sorted and groupby from itertools like that :

sorted(a,key=lambda x: (x['user_id'],x['clean_label']))
[ [tr for tr in tr_per_user_id_clean_label] for key, tr_per_user_id_clean_label in itertools.groupby(a, key=lambda x: (x['user_id'], x['clean_label'])) ]

but I get that :

[[{'user_id': '111', 'clean_label': 'VIR SEPA'}],
 [{'user_id': '112', 'clean_label': 'VIR SEPA'}],
 [{'user_id': '111', 'clean_label': 'VIR SEPA'}]]

Can someone help me ??

*Edit : when I sort a :

[{'user_id': '111', 'clean_label': 'VIR SEPA'},
 {'user_id': '111', 'clean_label': 'VIR SEPA'},
 {'user_id': '112', 'clean_label': 'VIR SEPA'}]

CodePudding user response:

sorted() returns a new list and does not change the order of the existing list. You want either a.sort() or groupby(sorted(a, key=...), key=...).

Although, why bother sorting at all? You could use a dict as an accumulator, like in mozway's answer.

CodePudding user response:

itertools.groupby is not really the ideal tool for this.

You can achieve your goal with O(n) complexity using a defaultdict (vs O(n log n) with groupby as you need to sort):

from collections import defaultdict

dd = defaultdict(list)

for d in a:
    dd[(d['user_id'], d['clean_label'])].append(d)
    
out = list(dd.values())

alternative with setdefault:

dd = {}

for d in a:
    dd.setdefault((d['user_id'], d['clean_label']), []).append(d)
    
out = list(dd.values())

output:

[[{'user_id': '111', 'clean_label': 'VIR SEPA'},
  {'user_id': '111', 'clean_label': 'VIR SEPA'}],
 [{'user_id': '112', 'clean_label': 'VIR SEPA'}]]

If the output needs to be sorted by user_id:

out = sorted(dd.values(),
             key=lambda x: (int(x[0]['user_id']), int(x[0]['clean_label'])))
  • Related