In a database, I have data that looks like this (just a snippet):
itemId userId action likeorDislike timestamp
i1 u1 rate 0 2021-06-09 10:43:57.827 UTC
i1 u1 rate 1 2021-06-10 10:43:57.827 UTC
i1 u2 rate 1 2021-06-09 11:43:57.827 UTC
i1 u3 rate 1 2021-06-09 12:43:57.827 UTC
i2 u6 rate 1 2021-06-09 10:43:57.827 UTC
i2 u6 rate 0 2021-08-09 10:43:57.827 UTC
i2 u1 rate 0 2021-06-12 10:43:57.827 UTC
i4 u1 rate 1 2021-06-09 10:45:57.827 UTC
i4 u1 rate 1 2021-06-09 10:48:57.827 UTC
i4 u3 rate 1 2021-06-09 10:45:58.827 UTC
i1 u5 select 2021-06-09 10:45:58.827 UTC
I want to compute a score for each itemId
, which is the number of 1
's over the sum of likeorDislike
for that particular item, but only take into consideration the first attempt of a user to rate that particular item.
itemId score
i1 0.66
i2 0.5
i4 1
What I did: Only kept the data that interests me (ordered by timestamp, cleaned etc.), but I don't know how to compute the above scores now:
itemId likeOrDislike
i1 0
i1 1
i2 1
i1 1
i2 0
i4 1
i4 1
I also wrote a SQL query that does compute these scores but doesn't take into consideration the timestamp.
%%bigquery counting_answers_new
SELECT trueAnswers.itemId, IFNULL(falseAnswers.answerCount, 0) as falseAnswerCount, trueAnswers.answerCount as trueAnswerCount, IFNULL((trueAnswers.answerCount/(falseAnswers.answerCount trueAnswers.answerCount)), 1) as score
FROM
( SELECT itemId, likeorDislike, action, COUNT(*) as answerCount
FROM `mydata`
GROUP BY itemId, likeorDislike, action
HAVING action='rate' AND likeorDislike=false
ORDER BY itemId) as falseAnswers
RIGHT JOIN
( SELECT itemId, likeorDislike, acton, COUNT(*) as answerCount
FROM `mydata`
GROUP BY itemId, likeorDislike, action
HAVING action='rate' AND likeorDislike=true
ORDER BY itemId) as trueAnswers
ON falseAnswers.itemId = trueAnswers.itemId
I would like to do it in pandas
and proceed from the dataframe with two columns that I have above because I already cleaned it/filtered it. I know how to count occurrences of 0's and 1's, but how do I do that pro item, and how do I compute these scores?
CodePudding user response:
It seems like you just want the mean of the itemId's likeOrDislike. In pandas you could do this:
df[["itemId", "likeOrDislike"]].groupby("itemId").mean()