I have an example df
:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['Josh', 'Paul','Ivy','Mark'],
'orderId':[1,2,3,4],
'purchases':[['sofa','sofa','chair'],
['chair','sofa'],
['sofa','chair'],
['sofa','chair','chair']]})
4 people bought the same items - sofa & chair
but in different quantities, but in general they all bought a sofa
and a chair
- only 1 combination of the distinct product, think of it as a set(purchases)
.
I want to answer how many times each combination of purchases has been bought - we know it's 4
since 4
people bought the same set of items.
So I thought of this as a pseudo code where I group by the set of each purchase
value:
df = df.groupby(set('purchases')).agg({'orderId':pd.Series.nunique})
But I get an error which is expected:
TypeError: 'set' object is not callable
I am wondering what is the best way to achieve the grouping by the set
of a value rather then the actual value, which in this case is a list.
When I tried simply grouping by the purchases
df = df.groupby('purchases').agg({'orderId':pd.Series.nunique})
I get :
TypeError: unhashable type: 'list'\
I tried changing the lists to tuples:
df = pd.DataFrame({'name':['Josh', 'Paul','Ivy','Mark'],
'orderId':[1,2,3,4],
'purchases':[('sofa','sofa','chair'),
('chair','sofa'),
('sofa','chair'),
('sofa','chair','chair')]})
And then
df = df.groupby('purchases').agg({'purchases':lambda x:{y for y in x}}) # or set(x)
But this gives
purchases
purchases
(chair, sofa) {(chair, sofa)}
(sofa, chair) {(sofa, chair)}
(sofa, chair, chair) {(sofa, chair, chair)}
(sofa, sofa, chair) {(sofa, sofa, chair)}
There is still a tuple inside the set as it looks for the same tuples and not looking inside the tuple?
And I tried:
df['purchases_unique'] = df['purchases'].apply(lambda x: set(x))
df['# of times bought'] = df.apply(lambda x: x.value_counts())
But I get :
TypeError: unhashable type: 'set' While the Jupyter notebook still provides an answer with log message:
Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
File "pandas\_libs\hashtable_class_helper.pxi", line 1709, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'set'
{sofa, chair} 4
So in summary I am both looking for an answer how could I assign the value 4 to each of the rows so that the result would look like this:
name orderId purchases # of times bought
Josh 1 (sofa, sofa, chair) 4
Paul 2 (chair, sofa) 4
Ivy 3 (sofa, chair) 4
Mark 4 (sofa, chair, chair) 4
And why isn't pandas allowing to group by a set
if python is capable of evaluating {'chair', 'sofa'} == {'sofa', 'chair'}
?
CodePudding user response:
Use:
df["times_bought"] = df.groupby(df["purchases"].apply(frozenset))["purchases"].transform("count")
print(df)
Output
name orderId purchases times_bought
0 Josh 1 [sofa, sofa, chair] 4
1 Paul 2 [chair, sofa] 4
2 Ivy 3 [sofa, chair] 4
3 Mark 4 [sofa, chair, chair] 4
The expression:
df["purchases"].apply(frozenset)
convert each list in purchases to a frozenset
:
0 (chair, sofa)
1 (chair, sofa)
2 (chair, sofa)
3 (chair, sofa)
From the documentation (emphasis mine):
The frozenset type is immutable and hashable — its contents cannot be altered after it is created; it can therefore be used as a dictionary key or as an element of another set.
Given that the elements after the .apply
are immutable and hashable they can be used in a DataFrame.groupby
.
Alternatives
In the end what you need is to map each element in purchases to the same identifier, given the constraints of your problem. So you could directly use the hash function of frozenset
as below:
def _hash(lst):
import sys
uniques = set(lst)
# https://stackoverflow.com/questions/20832279/python-frozenset-hashing-algorithm-implementation
MAX = sys.maxsize
MASK = 2 * MAX 1
n = len(uniques)
h = 1927868237 * (n 1)
h &= MASK
for x in uniques:
hx = hash(x)
h ^= (hx ^ (hx << 16) ^ 89869747) * 3644798167
h &= MASK
h = h * 69069 907133923
h &= MASK
if h > MAX:
h -= MASK 1
if h == -1:
h = 590923713
return h
df["times_bought"] = df.groupby(df["purchases"].apply(_hash))["purchases"].transform("count")
print(df)
Output
name orderId purchases times_bought
0 Josh 1 [sofa, sofa, chair] 4
1 Paul 2 [chair, sofa] 4
2 Ivy 3 [sofa, chair] 4
3 Mark 4 [sofa, chair, chair] 4
A second alternative would be to use (as an argument for groupby):
df["purchases"].apply(lambda x: tuple(sorted(set(x))))
This will find que unique elements, sorted them and finally convert them to a hashable representation (a tuple).