Home > OS >  Counting occurrences based on another column in a dataframe
Counting occurrences based on another column in a dataframe

Time:11-26

In a database, I have data that looks like this (just a snippet):

itemId   userId     action       likeorDislike     timestamp
i1       u1         rate         0                 2021-06-09 10:43:57.827 UTC
i1       u1         rate         1                 2021-06-10 10:43:57.827 UTC
i1       u2         rate         1                 2021-06-09 11:43:57.827 UTC
i1       u3         rate         1                 2021-06-09 12:43:57.827 UTC
i2       u6         rate         1                 2021-06-09 10:43:57.827 UTC
i2       u6         rate         0                 2021-08-09 10:43:57.827 UTC
i2       u1         rate         0                 2021-06-12 10:43:57.827 UTC
i4       u1         rate         1                 2021-06-09 10:45:57.827 UTC
i4       u1         rate         1                 2021-06-09 10:48:57.827 UTC
i4       u3         rate         1                 2021-06-09 10:45:58.827 UTC
i1       u5         select                         2021-06-09 10:45:58.827 UTC

I want to compute a score for each itemId, which is the number of 1's over the sum of likeorDislike for that particular item, but only take into consideration the first attempt of a user to rate that particular item.

itemId     score
i1         0.66                 
i2         0.5
i4         1

What I did: Only kept the data that interests me (ordered by timestamp, cleaned etc.), but I don't know how to compute the above scores now:

itemId     likeOrDislike     
i1         0      
i1         1 
i2         1
i1         1
i2         0
i4         1
i4         1

I also wrote a SQL query that does compute these scores but doesn't take into consideration the timestamp.

%%bigquery counting_answers_new
SELECT trueAnswers.itemId, IFNULL(falseAnswers.answerCount, 0) as falseAnswerCount, trueAnswers.answerCount as trueAnswerCount, IFNULL((trueAnswers.answerCount/(falseAnswers.answerCount trueAnswers.answerCount)), 1) as score
FROM 
 ( SELECT itemId, likeorDislike, action, COUNT(*) as answerCount
 FROM `mydata` 
 GROUP BY itemId, likeorDislike, action
 HAVING action='rate' AND likeorDislike=false
 ORDER BY itemId) as falseAnswers
 RIGHT JOIN
 ( SELECT itemId, likeorDislike, acton, COUNT(*) as answerCount
 FROM `mydata` 
 GROUP BY itemId, likeorDislike, action
 HAVING action='rate' AND likeorDislike=true
 ORDER BY itemId) as trueAnswers
 ON falseAnswers.itemId = trueAnswers.itemId

I would like to do it in pandas and proceed from the dataframe with two columns that I have above because I already cleaned it/filtered it. I know how to count occurrences of 0's and 1's, but how do I do that pro item, and how do I compute these scores?

CodePudding user response:

It seems like you just want the mean of the itemId's likeOrDislike. In pandas you could do this:

df[["itemId", "likeOrDislike"]].groupby("itemId").mean()
  • Related