This is a recommender system and I have a Dataframe
which contains about 10 recommended item for each user (recommendation_df
) and I have another Dataframe
which consist of the recent purchases of each user (recent_df
).
I am trying to code out this task but I can't seem to get along the syntax, and the manipulation
I am implementing a hit/miss ratio, basically for every new_party_id
in recent_df
, if any of the merch_store_code
matches the merch_store_code
for the same party_id
in the recommendation_df
, count =1
(Hit)
Then calculating the hit/miss ratio by using count/total user count
(However in recent_df, each user might have multiple recent purchases, but if any of the purchases is in the list of recommendations_list for the same user, take it as a hit (count =1)
recommendation_df
-------------- ---------------- ----------- ----------
|party_id_index|merch_store_code| rating| party_id|
-------------- ---------------- ----------- ----------
| 148| 900000166| 0.4021678|G18B00332C|
| 148| 168339566| 0.27687865|G18B00332C|
| 148| 168993309| 0.15999989|G18B00332C|
| 148| 168350313| 0.1431974|G18B00332C|
| 148| 168329726| 0.13634883|G18B00332C|
| 148| 168351967|0.120235085|G18B00332C|
| 148| 168993312| 0.11800903|G18B00332C|
| 148| 168337234|0.116267696|G18B00332C|
| 148| 168993256| 0.10836013|G18B00332C|
| 148| 168339482| 0.10341005|G18B00332C|
| 463| 168350313| 0.93455887|K18M926299|
| 463| 900000072| 0.8275664|K18M926299|
| 463| 700012303| 0.70220494|K18M926299|
| 463| 700012180| 0.23209469|K18M926299|
| 463| 900000157| 0.1727839|K18M926299|
| 463| 700013689| 0.13854747|K18M926299|
| 463| 900000166| 0.12866624|K18M926299|
| 463| 168993284|0.107065596|K18M926299|
| 463| 168993269| 0.10272527|K18M926299|
| 463| 168339566| 0.10256036|K18M926299|
-------------- ---------------- ----------- ----------
recent_df
------------ --------------- ----------------
|new_party_id|recent_purchase|merch_store_code|
------------ --------------- ----------------
| A11275842R| 2022-05-21| 168289403|
| A131584211| 2022-06-01| 168993311|
| A131584211| 2022-06-01| 168349493|
| A131584211| 2022-06-01| 168350192|
| A182P3539K| 2022-03-26| 168341707|
| A182V2883F| 2022-05-26| 168350824|
| A183B5482P| 2022-05-10| 168993464|
| A183C6900K| 2022-05-14| 168338795|
| A183D56093| 2022-05-20| 700012303|
| A183J5388G| 2022-03-18| 700013650|
| A183U8880P| 2022-04-01| 900000072|
| A183U8880P| 2022-04-01| 168991904|
| A18409762L| 2022-05-10| 168319352|
| A18431276J| 2022-05-14| 168163905|
| A18433684M| 2022-03-21| 168993324|
| A18433978F| 2022-05-20| 168341876|
| A184410389| 2022-05-04| 900000166|
| A184716280| 2022-04-06| 700013653|
| A18473797O| 2022-05-24| 168330339|
| A18473797O| 2022-05-24| 168350592|
------------ --------------- ----------------
Here is my current coding logic:
count = 0
def hitratio(recommendation_df, recent_df):
for i in recent_df['new_party_id']:
for j in recommendation_df['party_id']:
if (i = j) & i.merch_store_code == j.merch_store_code:
count = 1
return (count/recent_df.count())
CodePudding user response:
assumption : i am taking all the count rows of recent df as denominator for calculating the hit/miss ratio you can change the formula.
from pyspark.sql import functions as F
matching_cond = ((recent_df["merch_store_code"]==recommender_df["merch_store_code"]) &(recommendation_df["party_id"].isNotNull()))
df_recent_fnl= df_recent.join(recommendation_df,df_recent["new_party_id"]==recommendation_df["party_id"],"left")\
.select(df_recent["*"],recommender_df["merch_store_code"],recommendation_df["party_id"])\
.withColumn("hit",F.when(matching_cond,F.lit(True)).otherwise(F.lit(False)))\
.withColumn("hit/miss",df_recent_fnl.filter(F.col("hit")).count()/df_recent.count())
do let me know if you have any questions around this .
If you like my solution , you can upvote
CodePudding user response:
In Spark, refrain from loops on rows. Spark does not work like that, you need to think of the whole column, not about row-by-row scenario.
You need to join both tables and select users, but they need to be without duplicates (distinct)
from pyspark.sql import functions as F
df_distinct_matches = (
recent_df
.join(recommendation_df, F.col('new_party_id') == F.col('party_id'))
.select('party_id').distinct()
)
hit = df_distinct_matches.count()