Home > Software engineering >  Pandas; Need to combine duplicate columns, and find the mean of another column
Pandas; Need to combine duplicate columns, and find the mean of another column

Time:11-24

I have this data frame with about 200 rows, and I need to combine the duplicate writers columns, and then find the mean value of their viewership. How can I accomplish this? Below is a sample of the data frame.

     Viewership               Writers
0         11.20        Ricky Gervais 
1         11.20     Stephen Merchant 
2         11.20          Greg Daniels
3          8.70          Greg Daniels
4         10.30          Mindy Kaling
..          ...                   ...
192        3.25     Halsted Sullivan 
193        3.25    Warren Lieberstein
194        3.51  Niki Schwartz-Wright
195        4.56       Brent Forrester
196        5.69          Greg Daniels

[197 rows x 2 columns]

My solution was:

mean = df2.groupby(['Writers']).mean()
print(mean)

However, it still lists all the writers with duplicates, and does not combine their viewership into a mean value. The result I get is:

                      Viewership
Writers                         
 Brent Forrester        7.560000
 Gabe Miller            4.165000
 Gene Stupnitsky        8.618333
 Gene Stupnitsky       10.200000
 Greg Daniels           9.200000
 Halsted Sullivan       7.503333
 Justin Spitzer         7.670000
 Lee Eisenberg          7.867143
 Lee Eisenberg         10.120000
 Michael Schur          9.040000
 Mindy Kaling           9.420000
 Paul Lieberstein       7.395000
 Stephen Merchant       9.070000
 Stephen Merchant      11.200000
 Warren Lieberstein     5.280000
Aaron Shure             7.173333
Allison Silverman       4.746667
Amelie Gillette         5.655000
Anthony Q. Farrell      8.315000
B. J. Novak             7.718182
Brent Forrester         7.348889
Brent Forrester         7.670000
Caroline Williams       8.840000
Steve Carell            7.945000
Steve Hely              6.073333
Tim McAuliffe           3.440000
Warren Lieberstein      7.503333

I did my best to preserve the indentations I get in my results. As you can see, some writers have a whitespace at the beginning of their name. I'm sure this may be causing my issue?

CodePudding user response:

Try:

df2['Writers'] = df2['Writers'].str.strip()
mean = df2.groupby(['Writers']).mean()
print(mean)

This should remove any whitespace issues before grouping

  • Related