Home > Software engineering >  How to do a simple group by without any statistics like mean or count in Pandas?
How to do a simple group by without any statistics like mean or count in Pandas?

Time:12-09

I have a dataframe with many columns.

I want to do something like the summary tool in Alteryx. i.e. a simple group by. I am not interested in the statistics like mean or count or size.

I dont want a hierarchial output

How can I do it?

df.head()

      Unnamed: 0    home_odds    draw_odds    away_odds  country    league            datetime             home_team        away_team      home_score    away_score
--  ------------  -----------  -----------  -----------  ---------  ----------------  -------------------  ---------------  -----------  ------------  ------------
 0             0         1.36         4.31         7.66  Brazil     Copa do Nordeste  2020-02-07 00:00:00  Sport Recife     Imperatriz              2             2
 1             1         2.62         3.3          2.48  Brazil     Copa do Nordeste  2020-02-02 22:00:00  ABC              America RN              2             1
 2             2         5.19         3.58         1.62  Brazil     Copa do Nordeste  2020-02-02 00:00:00  Frei Paulistano  Nautico                 0             2
 3             3         2.06         3.16         3.5   Brazil     Copa do Nordeste  2020-02-02 22:00:00  Botafogo PB      Confianca               1             1
 4             4         2.19         2.98         3.38  Brazil     Copa do Nordeste  2020-02-02 22:00:00  Fortaleza        Ceara                   1             1

I want a simple grouped by these two columns

e.g.

    country    league
--  ---------  ----------------
 0  Brazil     Copa do Nordeste
 1  World      World Cup
 2  Australia  FFA Cup
 3  Australia  A League
 4  Brazil     Serie A

What is the correct code for it?

CodePudding user response:

If you want to create a dataframe with two columns which shows which country has which league, then select the relevant columns and .drop_duplicates method:

out = df[['country', 'league']].drop_duplicates()

But if you want to create a dataframe that keeps data from the other columns, you might want to do:

out = df.groupby(['country', 'league']).agg(list)

CodePudding user response:

You can try using drop_duplicates() function. For example in first dataframe below, you can see 13 records and in second only 6. It was thanks to df = df.drop_duplicates()

enter image description here

  • Related