pandas count of rows for every combination of columns for one hot encoded data-CodePudding

Thank you so much for answering this :) I have a pandas data frame where we have one hot encoded information about whether a customer has a certain product or not

customer revenue  P1    P2  P3
Customer1   $1     1    0   0
Customer2   $2     1    1   0
Customer3   $3     0    1   1
Customer4   $4     1    1   1
Customer5   $5     1    0   0

for customer1 revenue is $1 and it has products P1 only, similarly, customer4 has all products

I want to transform this data to show all possible combinations of the products and the count of customers that have those combinations, like so

combinations    Count of Customers  Sum of revenue  
P1 only          2                       $6         ---> c1 c5
P2 only         
P3 only         
P1 P2 P3         1                       $4         ---> c4
P1 P2            1                       $2         ---> c2
P1 P3           
P2 P3            1                       $3         ---> c3

What's the best way to achieve this?

I believe The combinations of products can be checked like this

import itertools
li=df.columns.to_list()
del(li[0:2])
list(itertools.combinations(li, 2)) list(itertools.combinations(li, 1)) list(itertools.combinations(li, 3))

CodePudding user response：

The key here is to revert the operation of get_dummies (or OneHotEncoder) to get a unique combination of products per customer to be able to group and aggregate:

# Generate labels P1, P2, P3, P1 P2, ...
cols = ['P1', 'P2', 'P3']
labels = [' '.join(c) for i in range(len(cols)) for c in combinations(cols, i 1)]

df1 = df.assign(R=df['revenue'].str.strip('$').astype(int),
                P=df[cols].mul(2 ** np.arange(len(cols))).sum(axis=1)) \
        .groupby('P').agg(**{'Sum of revenue': ('R', 'sum'),
                             'Count of customers': ('customer', 'nunique')}) \
        .assign(Product=lambda x: x.index.map(dict(zip(range(1, len(labels) 1), labels)))) \
        .set_index('Product').reindex(labels, fill_value=0).reset_index()
print(df1)

# Output
    Product  Sum of revenue  Count of customers
0        P1               6                   2
1        P2               0                   0
2        P3               2                   1
3     P1 P2               0                   0
4     P1 P3               0                   0
5     P2 P3               3                   1
6  P1 P2 P3               4                   1

For each unique combination, you assign a label:

# P1=1, P2=2, P3=4 (power of 2) so P1 P3=1 4=5
>>> dict(zip(range(1, len(labels) 1), labels))
{1: 'P1',
 2: 'P2',
 3: 'P3',
 4: 'P1 P2',
 5: 'P1 P3',
 6: 'P2 P3',
 7: 'P1 P2 P3'}

Detail about reverting the get_dummies operation:

>>> df[cols].mul(2 ** np.arange(len(cols))).sum(axis=1)
0    1  # P1 only
1    3  # P3 only
2    6  # P2 P3
3    7  # P1 P2 P3
4    1  # P1 only
dtype: int64

CodePudding user response：

Since you are already using one-hot encoding, let's try to solve this problem using binary masks:

# Extract the product columns
cols = df.columns[2:]

# Make a bitmask for each column.
#  1 = b001 -> P1
#  2 = b010 -> P2
#  4 = b100 -> P3
bitmask = np.power(2, np.arange(len(cols)))

# Generate the label for each possible combination of products
# 0 = b000 -> Nothing
# 1 = b001 -> P1
# 2 = b010 -> P2
# 3 = b011 -> P1   P2
# ...
# 7 = b111 -> P1   P2   P3
labels = np.array([
    "   ".join(cols[np.bitwise_and(bitmask, i) != 0])
    for i in range(2 ** len(cols))
])
labels[0] = "Nothing"

# Assign each row in the df to a group based on the
# product combination (P1, P1   P2, etc.)
# Using numpy broadcasting to multiply the product list
# with the bitmask:
#   df[cols]    *  bitmask[None, :]  -> groups
#   P1  P2  P3     
#   1   0   0      1   2   4            1*1   0*2   0*4 = 1
#   1   1   0      1   2   4            1*1   1*2   0*4 = 3
#   0   1   1      1   2   4            0*1   1*2   1*4 = 6
#   1   1   1      1   2   4            ...             = 7
#   1   0   0      1   2   4            ...             = 1
groups = (df[cols].to_numpy() * bitmask[None, :]).sum(axis=1)

# The result
df.groupby(labels[groups]).agg(**{
    "Count of Customers": ("customer", "count"),
    "Sum of Revenue": ("revenue", "sum")
}).reindex(labels, fill_value=0)

CodePudding user response：

You can do something like this:

In [2692]: from itertools import product

In [2673]: x = df.pivot(['P1', 'P2', 'P3'], 'customer', 'revenue')
In [2661]: ix = pd.MultiIndex.from_tuples(list(product(set(range(2)),repeat = 3)), names=['P1', 'P2', 'P3'])

In [2688]: s = pd.DataFrame(index=ix)

In [2681]: output = s.join(x).sum(1).to_frame('Sum of revenue')
In [2686]: output['Count of Customers'] = s.join(x).count(1)

In [2687]: output
Out[2687]: 
          Sum of revenue  Count of Customers
P1 P2 P3                                    
0  0  0              0.0                   0
      1              0.0                   0
   1  0              0.0                   0
      1              3.0                   1
1  0  0              6.0                   2
      1              0.0                   0
   1  0              2.0                   1
      1              4.0                   1

CodePudding user response：

my_df = df.groupby(df.columns.tolist()[1:]).agg({'P1':'size', 'revenue':"sum"})
my_df.columns = ['Count of Customers ', 'Sum of revenue  ']