Computing averages from two dataframes-CodePudding

One dataframe looks like this (these are just small samples):

user    item     rating
u1      i1       0
u1      i2       1
u1      i3       1
u2      i2       1
u2      i3       0
u2      i4       0
u2      i5       0
u3      i6       1
u3      i7       1
u4      i4       1
u4      i7       0

And another one like this:

item     property_1    ...   property_n    grade
i1       0.60                1             Bad
i2       0.90                1             Good
i3       0.90                0             Good
i4       0.75                1             Bad

There might be items that are in the first dataframe and not in the second one and vice-versa.

What I want to find out is whether items that are bad are really badly rated, and whether items that are good are really well rated. I know I can get the average item rating like this:

df.groupby(['item'])['rating'].mean()

And if I do it for the user, I get the average rating given by that user.

But how can I build a dataframe like the following?

item      average_rating      grade
i1                    0         Bad
i2                    1        Good
i3                  0.5        Good
i4                  0.5         Bad

A bit of an extra question: how could I incorporate the information about the users? (Maybe a user gave a lot of 0 reviews, but you can't blame them because the items were considered bad too, but another one gave a lot of bad reviews to items that were considered good). But not all items are in the second dataframe and we don't know whether the user's reviews were consistent with the grades (maybe they rated well 5 bad items and rated badly 5 good items). So I would instead convert the item grades to 0 and 1:

user       reviews_given           item_grades       consistency
u1               [0,1,1]               [0,1,1]              1.00
u80            [0,1,0,1]             [1,0,1,0]              0.00

Would such lists work? How is it possible to do this?

CodePudding user response：

To the first question: You can use df.merge() to do a merge (like a SQL join) on the two tables.

df_group = df.groupby(['item']).mean() # notice i don't have the ['rating'] piece in here. that makes this a series, but we want a dataframe to do the merge
df_group.merge(df2[['item','grade']],on='item') # i just pull from df2 the two columns that I need
    item    rating  grade
0   i1      0.0     Bad
1   i2      1.0     Good
2   i3      0.5     Good
3   i4      0.5     Bad

CodePudding user response：

To the second question: you can also use the merge, plus some other things.

Below I consider df to be the first table you posted and df2 the second one you posted.

df3 = df.merge(df2[['item','grade']],on='item') # similar to above, but we haven't grouped yet and still have user
df3['grade'] = df3['grade'].map({'Bad':0,'Good':1}) # convert grades to numbers
df3['match?'] = (df3['rating'] == df3['grade']).map({True:1,False:0}) # new column that returns 1 if the rating and grade match, 0 if don't match
df4 = df3.groupby('user').agg({'rating':list,'grade':list,'match?':list,'match?':'mean'}) # a new dataframe that groups the last one but groups a few columns into lists and also gets a mean of matches
df4
         rating      grade    match?
user
u1    [0, 1, 1]  [0, 1, 1]  1.000000
u2    [1, 0, 0]  [1, 1, 0]  0.666667
u4          [1]        [0]  0.000000
>>>