Grouping by a set in pandas-CodePudding

I have an example df:

import pandas as pd 
import numpy as np


df = pd.DataFrame({'name':['Josh', 'Paul','Ivy','Mark'],
                   'orderId':[1,2,3,4],
                   'purchases':[['sofa','sofa','chair'],
                                ['chair','sofa'],
                                ['sofa','chair'],
                                ['sofa','chair','chair']]})

4 people bought the same items - sofa & chair but in different quantities, but in general they all bought a sofa and a chair - only 1 combination of the distinct product, think of it as a set(purchases).

I want to answer how many times each combination of purchases has been bought - we know it's 4 since 4 people bought the same set of items.

So I thought of this as a pseudo code where I group by the set of each purchase value:

df = df.groupby(set('purchases')).agg({'orderId':pd.Series.nunique})

But I get an error which is expected:

TypeError: 'set' object is not callable

I am wondering what is the best way to achieve the grouping by the set of a value rather then the actual value, which in this case is a list.

When I tried simply grouping by the purchases

df = df.groupby('purchases').agg({'orderId':pd.Series.nunique})

I get :

TypeError: unhashable type: 'list'\

I tried changing the lists to tuples:

df = pd.DataFrame({'name':['Josh', 'Paul','Ivy','Mark'],
                   'orderId':[1,2,3,4],
                   'purchases':[('sofa','sofa','chair'),
                                ('chair','sofa'),
                                ('sofa','chair'),
                                ('sofa','chair','chair')]})

And then

df = df.groupby('purchases').agg({'purchases':lambda x:{y for y in x}}) # or set(x)

But this gives

                        purchases
purchases   
(chair, sofa)           {(chair, sofa)}
(sofa, chair)           {(sofa, chair)}
(sofa, chair, chair)    {(sofa, chair, chair)}
(sofa, sofa, chair)     {(sofa, sofa, chair)}

There is still a tuple inside the set as it looks for the same tuples and not looking inside the tuple?

And I tried:

df['purchases_unique'] = df['purchases'].apply(lambda x: set(x))
df['# of times bought'] = df.apply(lambda x: x.value_counts())

But I get :

TypeError: unhashable type: 'set' While the Jupyter notebook still provides an answer with log message:

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas\_libs\hashtable_class_helper.pxi", line 1709, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'set'
{sofa, chair}    4

So in summary I am both looking for an answer how could I assign the value 4 to each of the rows so that the result would look like this:

name        orderId         purchases                   # of times bought
Josh        1               (sofa, sofa, chair)         4
Paul        2               (chair, sofa)               4
Ivy         3               (sofa, chair)               4
Mark        4               (sofa, chair, chair)        4

And why isn't pandas allowing to group by a set if python is capable of evaluating {'chair', 'sofa'} == {'sofa', 'chair'}?

CodePudding user response：

Use:

df["times_bought"] = df.groupby(df["purchases"].apply(frozenset))["purchases"].transform("count")
print(df)

Output

   name  orderId             purchases  times_bought
0  Josh        1   [sofa, sofa, chair]             4
1  Paul        2         [chair, sofa]             4
2   Ivy        3         [sofa, chair]             4
3  Mark        4  [sofa, chair, chair]             4

The expression:

df["purchases"].apply(frozenset)

convert each list in purchases to a frozenset:

0    (chair, sofa)
1    (chair, sofa)
2    (chair, sofa)
3    (chair, sofa)

From the documentation (emphasis mine):

The frozenset type is immutable and hashable — its contents cannot be altered after it is created; it can therefore be used as a dictionary key or as an element of another set.

Given that the elements after the .apply are immutable and hashable they can be used in a DataFrame.groupby.

Alternatives

In the end what you need is to map each element in purchases to the same identifier, given the constraints of your problem. So you could directly use the hash function of frozenset as below:

def _hash(lst):
    import sys
    uniques = set(lst)
    # https://stackoverflow.com/questions/20832279/python-frozenset-hashing-algorithm-implementation
    MAX = sys.maxsize
    MASK = 2 * MAX   1
    n = len(uniques)
    h = 1927868237 * (n   1)
    h &= MASK
    for x in uniques:
        hx = hash(x)
        h ^= (hx ^ (hx << 16) ^ 89869747)  * 3644798167
        h &= MASK
    h = h * 69069   907133923
    h &= MASK
    if h > MAX:
        h -= MASK   1
    if h == -1:
        h = 590923713
    return h


df["times_bought"] = df.groupby(df["purchases"].apply(_hash))["purchases"].transform("count")
print(df)

Output

   name  orderId             purchases  times_bought
0  Josh        1   [sofa, sofa, chair]             4
1  Paul        2         [chair, sofa]             4
2   Ivy        3         [sofa, chair]             4
3  Mark        4  [sofa, chair, chair]             4

A second alternative would be to use (as an argument for groupby):

df["purchases"].apply(lambda x: tuple(sorted(set(x))))

This will find que unique elements, sorted them and finally convert them to a hashable representation (a tuple).