Thank you so much for answering this :) I have a pandas data frame where we have one hot encoded information about whether a customer has a certain product or not
customer revenue P1 P2 P3
Customer1 $1 1 0 0
Customer2 $2 1 1 0
Customer3 $3 0 1 1
Customer4 $4 1 1 1
Customer5 $5 1 0 0
for customer1 revenue is $1 and it has products P1 only, similarly, customer4 has all products
I want to transform this data to show all possible combinations of the products and the count of customers that have those combinations, like so
combinations Count of Customers Sum of revenue
P1 only 2 $6 ---> c1 c5
P2 only
P3 only
P1 P2 P3 1 $4 ---> c4
P1 P2 1 $2 ---> c2
P1 P3
P2 P3 1 $3 ---> c3
What's the best way to achieve this?
I believe The combinations of products can be checked like this
import itertools
li=df.columns.to_list()
del(li[0:2])
list(itertools.combinations(li, 2)) list(itertools.combinations(li, 1)) list(itertools.combinations(li, 3))
CodePudding user response:
The key here is to revert the operation of get_dummies
(or OneHotEncoder
) to get a unique combination of products per customer to be able to group and aggregate:
# Generate labels P1, P2, P3, P1 P2, ...
cols = ['P1', 'P2', 'P3']
labels = [' '.join(c) for i in range(len(cols)) for c in combinations(cols, i 1)]
df1 = df.assign(R=df['revenue'].str.strip('$').astype(int),
P=df[cols].mul(2 ** np.arange(len(cols))).sum(axis=1)) \
.groupby('P').agg(**{'Sum of revenue': ('R', 'sum'),
'Count of customers': ('customer', 'nunique')}) \
.assign(Product=lambda x: x.index.map(dict(zip(range(1, len(labels) 1), labels)))) \
.set_index('Product').reindex(labels, fill_value=0).reset_index()
print(df1)
# Output
Product Sum of revenue Count of customers
0 P1 6 2
1 P2 0 0
2 P3 2 1
3 P1 P2 0 0
4 P1 P3 0 0
5 P2 P3 3 1
6 P1 P2 P3 4 1
For each unique combination, you assign a label:
# P1=1, P2=2, P3=4 (power of 2) so P1 P3=1 4=5
>>> dict(zip(range(1, len(labels) 1), labels))
{1: 'P1',
2: 'P2',
3: 'P3',
4: 'P1 P2',
5: 'P1 P3',
6: 'P2 P3',
7: 'P1 P2 P3'}
Detail about reverting the get_dummies
operation:
>>> df[cols].mul(2 ** np.arange(len(cols))).sum(axis=1)
0 1 # P1 only
1 3 # P3 only
2 6 # P2 P3
3 7 # P1 P2 P3
4 1 # P1 only
dtype: int64
CodePudding user response:
Since you are already using one-hot encoding, let's try to solve this problem using binary masks:
# Extract the product columns
cols = df.columns[2:]
# Make a bitmask for each column.
# 1 = b001 -> P1
# 2 = b010 -> P2
# 4 = b100 -> P3
bitmask = np.power(2, np.arange(len(cols)))
# Generate the label for each possible combination of products
# 0 = b000 -> Nothing
# 1 = b001 -> P1
# 2 = b010 -> P2
# 3 = b011 -> P1 P2
# ...
# 7 = b111 -> P1 P2 P3
labels = np.array([
" ".join(cols[np.bitwise_and(bitmask, i) != 0])
for i in range(2 ** len(cols))
])
labels[0] = "Nothing"
# Assign each row in the df to a group based on the
# product combination (P1, P1 P2, etc.)
# Using numpy broadcasting to multiply the product list
# with the bitmask:
# df[cols] * bitmask[None, :] -> groups
# P1 P2 P3
# 1 0 0 1 2 4 1*1 0*2 0*4 = 1
# 1 1 0 1 2 4 1*1 1*2 0*4 = 3
# 0 1 1 1 2 4 0*1 1*2 1*4 = 6
# 1 1 1 1 2 4 ... = 7
# 1 0 0 1 2 4 ... = 1
groups = (df[cols].to_numpy() * bitmask[None, :]).sum(axis=1)
# The result
df.groupby(labels[groups]).agg(**{
"Count of Customers": ("customer", "count"),
"Sum of Revenue": ("revenue", "sum")
}).reindex(labels, fill_value=0)
CodePudding user response:
You can do something like this:
In [2692]: from itertools import product
In [2673]: x = df.pivot(['P1', 'P2', 'P3'], 'customer', 'revenue')
In [2661]: ix = pd.MultiIndex.from_tuples(list(product(set(range(2)),repeat = 3)), names=['P1', 'P2', 'P3'])
In [2688]: s = pd.DataFrame(index=ix)
In [2681]: output = s.join(x).sum(1).to_frame('Sum of revenue')
In [2686]: output['Count of Customers'] = s.join(x).count(1)
In [2687]: output
Out[2687]:
Sum of revenue Count of Customers
P1 P2 P3
0 0 0 0.0 0
1 0.0 0
1 0 0.0 0
1 3.0 1
1 0 0 6.0 2
1 0.0 0
1 0 2.0 1
1 4.0 1
CodePudding user response:
my_df = df.groupby(df.columns.tolist()[1:]).agg({'P1':'size', 'revenue':"sum"})
my_df.columns = ['Count of Customers ', 'Sum of revenue ']