Python group by and count distinct values in a column and create delimited list-CodePudding

So I have found/figured out individual answers for counting the group by results, as well as returning the unique delimited list....but not for counting the unique values in group by.

I want to create a delimited field with only the unique values based on the first column, and then create a field that counts those unique values.

import pandas as pd

companies = ['Microsoft', 'Google', 'Amazon', 'Microsoft', 'Facebook', 'Google', 'Microsoft', 'Google']
products = ['OS', 'Search', 'E-comm', 'X-box', 'Social Media', 'Android', 'OS', 'Search']
df = pd.DataFrame({'company' : companies, 'product':products })

#this creates the non-unique delimited list and counts total occurances
df2 = df.groupby('company').product.agg([('count', 'count'), ('product', ', '.join)])

#this creates the unique delimited list, but doesn't count occurances
df3 = (df.groupby('company')['product']
       .apply(lambda x: list(x.unique()))
       .reset_index())


#This is my desired output
     company            product  count
0     Amazon           [E-comm]      1
1   Facebook     [Social Media]      1
2     Google  [Search, Android]      2
3  Microsoft        [OS, X-box]      2

I haven't been able to successfully blend these two, but I know it should be relatively simple and straightforward?

CodePudding user response：

You can drop_duplicates first and then agg:

print (df.drop_duplicates().groupby("company", as_index=False)["product"].agg([list, "count"]))

                        list  count
company                            
Amazon              [E-comm]      1
Facebook      [Social Media]      1
Google     [Search, Android]      2
Microsoft        [OS, X-box]      2

Or using unique and nunique:

print (df.groupby("company")["product"].agg(["unique", "nunique"]))

                      unique  nunique
company                              
Amazon              [E-comm]        1
Facebook      [Social Media]        1
Google     [Search, Android]        2
Microsoft        [OS, X-box]        2

CodePudding user response：

You can use str.len in your code:

df3 = (df.groupby('company')['product']
         .apply(lambda x: list(x.unique()))
         .reset_index()
         .assign(count=lambda d: d['product'].str.len())  ## added line
      )

output:

     company            product  count
0     Amazon           [E-comm]      1
1   Facebook     [Social Media]      1
2     Google  [Search, Android]      2
3  Microsoft        [OS, X-box]      2