Group by column and have a column with a value_counts dictionary PYSPARK-CodePudding

I have a sales table with customer purchase history. I want to make a new dataframe grouped by customer. The dataframe should also include a column with a value_counts dictionary of all the products that customer has purchased and the count of how many of each product

THIS IS WHAT MY DATAFRAME LOOKS LIKE NOW

              Description                                     Counts
CustomerID
3004000304    MAJOR APPLIANCES,HOME OFFICE, OTHER STUFF          3
3004000304    HOME OFFICE, MAJOR APPLIANCES                      2
3004000304    ACCESSORIES, OTHER STUFF                           2
3004002756    MAJOR APPLIANCES, ACCESSORIES                      2
3004002946    HOME OFFICE, HOME OFFICE                           2
3004002946    ACCESSORIES, MAJOR APPLIANCES                      2
3004002946    MAJOR APPLIANCES, OTHER STUFF, ACCESSORIES         3

I WANT THIS

              Counts
CustomerID
3004000304    {'MAJOR APPLIANCES': 2, 'HOME OFFICE': 2, 'ACCESSORIES': 1, 'OTHER STUFF':2}
3004002756    {'MAJOR APPLIANCES': 1, 'ACCESSORIES': 1}
3004002946    {'HOME OFFICE': 2, 'ACCESSORIES': 2, 'MAJOR APPLIANCES': 1,'OTHER STUFF':1}

MY APPROACH I converted the PYSPARK DF to Pandas Df as I don't have much experience with Pyspark and new to it, A direct pyspark help will be much appreciated, else even pandas implementation would be great.

f = lambda x: dict(zip(x['Description'], x['Counts']))
df = categories.groupby(level=0).apply(f).to_frame('Counts')
print (df)

THIS IS NOT GIVING ME THE EXPECTED RESULT

CodePudding user response：

For Pandas version, you may do:

from collections import Counter
df['Desc'] = df['Description'].apply(lambda x:x.split(","))
df['Desc'] = df['Desc'].apply(lambda x:[e.strip() for e in x])
df2 = df.groupby(“CustomerID”)['Desc'].agg(sum)
df2['Counts'] = df2['Desc'].apply(lambda x:Counter(x))

CodePudding user response：

For Pyspark solution you can refer this SO answer ,

Code will be as

df.groupBy("CustomerID").agg(
    F.map_from_entries(
        F.collect_list(
            F.struct("Description", "Counts"))).alias("Description_Counts")
).show(truncate=False)

Note that map_from_entries is only available from spark version >= 2.4.0