Relating value_counts() of a variable with mean() of another variable-CodePudding

So, I have a Data frame listing consumer complaints, I am analyzing the following columns: 'Company', with repeating values, and 'Grade' given by costumer, being an integer from 1 to 5.

I looks like this:

    | Company   |  Grade  |
-------------------------------
0   | Company1  |    5    |
1   | Company64 |    2    |
2   | Company1  |    1    |
3   | Company6  |    3    |
...

I want to analyze the relation in between the number of complaints for each company and the mean() of the grade given by costumers for each company.

I'm not sure about the best way to do this, so far I created another df (df2) to hold those values:

df2 = pd.DataFrame(columns= ('Complaints','Grade'))
df2['Complaints'] = df2['Company'].value_counts()
df2

          |  Complaints   |   Grade   |
--------------------------------------
Company1  |     5549      |    NaN
Company23 |     5403      |    NaN
Company8  |     3883      |    NaN
Company30 |     2493      |    NaN

I'm not sure how to insert the values of the df.Grade.mean() for each company on the df2

Something is telling me that I can use multi-indexing for that, by grouping the companies on the first df, but I'm unsure how to proceed with that.

Afterwards I will plot this df2 with seaborn in a way too clearly see the relation in between those 2 variables, so if there is a graph that would shortcut this, please let me know.

CodePudding user response：

I would try to use the update function:

df2.update(df1.groupby(['Company']).mean())

CodePudding user response：

Since you said each company has multiple grades, you could groupby "Company" and find the mean grade for each company.

grades = df1.groupby('Company')['Grade'].mean()

Then, do a similar job on df2, except, you count complaints:

complaints = df2.groupby('Company')['Complaints'].count()

Now join the two:

data = grades.join(complaints)

Finally, since you want to see the relationship between mean grade and number of complaints, you could use scatter plot:

data.plot(kind='scatter', x='Complaints', y='Grade')