1
I have a sales table with customer purchase history. I want to make a new dataframe grouped by customer. The dataframe should also include a column with a value_counts dictionary of all the products that customer has purchased and the count of how many of each product
THIS IS WHAT MY DATAFRAME LOOKS LIKE NOW
Description Counts
CustomerID
3004000304 MAJOR APPLIANCES,HOME OFFICE, OTHER STUFF 3
3004000304 HOME OFFICE, MAJOR APPLIANCES 2
3004000304 ACCESSORIES, OTHER STUFF 2
3004002756 MAJOR APPLIANCES, ACCESSORIES 2
3004002946 HOME OFFICE, HOME OFFICE 2
3004002946 ACCESSORIES, MAJOR APPLIANCES 2
3004002946 MAJOR APPLIANCES, OTHER STUFF, ACCESSORIES 3
I WANT THIS
Counts
CustomerID
3004000304 {'MAJOR APPLIANCES': 2, 'HOME OFFICE': 2, 'ACCESSORIES': 1, 'OTHER STUFF':2}
3004002756 {'MAJOR APPLIANCES': 1, 'ACCESSORIES': 1}
3004002946 {'HOME OFFICE': 2, 'ACCESSORIES': 2, 'MAJOR APPLIANCES': 1,'OTHER STUFF':1}
MY APPROACH I converted the PYSPARK DF to Pandas Df as I don't have much experience with Pyspark and new to it, A direct pyspark help will be much appreciated, else even pandas implementation would be great.
f = lambda x: dict(zip(x['Description'], x['Counts']))
df = categories.groupby(level=0).apply(f).to_frame('Counts')
print (df)
THIS IS NOT GIVING ME THE EXPECTED RESULT
CodePudding user response:
For Pandas version, you may do:
from collections import Counter
df['Desc'] = df['Description'].apply(lambda x:x.split(","))
df['Desc'] = df['Desc'].apply(lambda x:[e.strip() for e in x])
df2 = df.groupby(“CustomerID”)['Desc'].agg(sum)
df2['Counts'] = df2['Desc'].apply(lambda x:Counter(x))
CodePudding user response:
For Pyspark solution you can refer this SO answer ,
Code will be as
df.groupBy("CustomerID").agg(
F.map_from_entries(
F.collect_list(
F.struct("Description", "Counts"))).alias("Description_Counts")
).show(truncate=False)
Note that map_from_entries is only available from spark version >= 2.4.0