Pandas unique count of number of records in a column where mutliple values are present in another co-CodePudding

I'm trying to count the unique number of Customer_Key where the column Broad_Category has both the values A and B grouped by values in column Month. The sample dataframe is as follows

Customer_Key	Category	Month
ck123	A	2
ck234	A	2
ck234	B	2
ck680	A	3
ck123	B	3
ck123	A	3
ck356	B	3
ck345	A	4

The expected outcome is

Month	Unique Customers
2	1
3	1
4	0

I'm not able to think of something here. Any lead/help will be appreciated. Thanks in advance.emphasized text

CodePudding user response：

Here is one way to accomplish it

first its grouping by Month and Customer that get us the customer within a month with the count of the categories. The result is further grouped by Month and we choose the max count.

Decrementing the count give us the required count of unique customer belonging to both categories

hope it helps

df2=df.groupby(['Month','Customer_Key']).count().reset_index().groupby(['Month'])['Category'].max().reset_index() 
df2['Category'] = df2['Category'] -1
df2.rename(columns={'Category': 'Unique Cusomter'}, inplace=True)
df2

    Month   Unique Cusomter
0   2   1
1   3   1
2   4   0

CodePudding user response：

Try something like that:

df.groupby(['Customer_Key', 'Month']) \
  .sum() \
  .query("Category in ('AB','BA')") \
  .groupby('Month') \
  .count() \
  .rename(columns={'Category': 'Unique Customers'})

Edit...

The issue with this solution is that it does not count months with 0. I have prepared a fix:

import pandas as pd
import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

data = StringIO("""ck123    A   2
ck234   A   2
ck234   B   2
ck680   A   3
ck123   B   3
ck123   A   3
ck356   B   3
ck345   A   4""")

df1 = df.groupby(['Customer_Key', 'Month']) \
        .sum() \
        .reset_index()

def map_categories(column):
  if column['Category'] in ('AB', 'BA'):
    return 1
  else:
    return 0

df1['Unique Customers'] = df1.apply(lambda row: map_categories(row), axis=1)

df1 = df1.groupby('Month')['Unique Customers'].sum().reset_index()