Home > database >  Finding and most frequent string using groupby in pandas
Finding and most frequent string using groupby in pandas

Time:07-20

I'm trying to find the name of the person who submitted the most applications in any given year over a series of years.

Each application is its own row in the dataframe. It comes with the year it was submitted, and the applicant's name.

I tried using groupby to organize the data by year and name, then a variety of methods such as value_counts(), count(), max(), etc...

This is the closest I've gotten:

df3.groupby(['app_year_start'])['name'].value_counts().sort_values(ascending=False)

It produces the following output:

app_year_start        name               total_apps
2015                  John Smith         622
2013                  John Smith         614
2014                  Jane Doe           611
2016                  Jon Snow           549

My desired output:

app_year_start        name                  total_apps
2015                  top_applicant         max_num
2014                  top_applicant         max_num
2013                  top_applicant         max_num
2012                  top_applicant         max_num

Some lines of dummy data:

app_year_start        name
2012                  John Smith
2012                  John Smith
2012                  John Smith
2012                  Jane Doe
2013                  Jane Doe
2012                  John Snow
2015                  John Snow
2014                  John Smith
2015                  John Snow
2012                  John Snow
2012                  John Smith
2012                  John Smith
2012                  John Smith
2012                  John Smith
2012                  Jane Doe
2013                  Jane Doe
2012                  John Snow
2015                  John Snow
2014                  John Smith
2015                  John Snow
2012                  John Snow
2012                  John Smith

I've consulted the follow SO posts:

result

CodePudding user response:

You can use mode per group:

df.groupby('app_year_start')['name'].agg(lambda x: x.mode().iloc[0])

Or, if you want all values joined as a single string in case of a tie:

df.groupby('app_year_start')['name'].agg(lambda x: ', '.join(x.mode()))

Output:

app_year_start
2012    John Smith
2013      Jane Doe
2014    John Smith
2015     John Snow
Name: name, dtype: object

Variant of your initial code:

(df
 .groupby(['app_year_start', 'name'])['name']
 .agg(total_apps='count')
 .sort_values(by='total_apps', ascending=False)
 .reset_index()
 .groupby('app_year_start', as_index=False)
 .first()
 )

Output:

   app_year_start        name  total_apps
0            2012  John Smith           8
1            2013    Jane Doe           2
2            2014  John Smith           2
3            2015   John Snow           4
  • Related