I have a dataframe with many columns.
I want to do something like the summary tool in Alteryx. i.e. a simple group by. I am not interested in the statistics like mean or count or size.
I dont want a hierarchial output
How can I do it?
df.head()
Unnamed: 0 home_odds draw_odds away_odds country league datetime home_team away_team home_score away_score
-- ------------ ----------- ----------- ----------- --------- ---------------- ------------------- --------------- ----------- ------------ ------------
0 0 1.36 4.31 7.66 Brazil Copa do Nordeste 2020-02-07 00:00:00 Sport Recife Imperatriz 2 2
1 1 2.62 3.3 2.48 Brazil Copa do Nordeste 2020-02-02 22:00:00 ABC America RN 2 1
2 2 5.19 3.58 1.62 Brazil Copa do Nordeste 2020-02-02 00:00:00 Frei Paulistano Nautico 0 2
3 3 2.06 3.16 3.5 Brazil Copa do Nordeste 2020-02-02 22:00:00 Botafogo PB Confianca 1 1
4 4 2.19 2.98 3.38 Brazil Copa do Nordeste 2020-02-02 22:00:00 Fortaleza Ceara 1 1
I want a simple grouped by these two columns
e.g.
country league
-- --------- ----------------
0 Brazil Copa do Nordeste
1 World World Cup
2 Australia FFA Cup
3 Australia A League
4 Brazil Serie A
What is the correct code for it?
CodePudding user response:
If you want to create a dataframe with two columns which shows which country has which league, then select the relevant columns and .drop_duplicates
method:
out = df[['country', 'league']].drop_duplicates()
But if you want to create a dataframe that keeps data from the other columns, you might want to do:
out = df.groupby(['country', 'league']).agg(list)
CodePudding user response:
You can try using drop_duplicates() function. For example in first dataframe below, you can see 13 records and in second only 6. It was thanks to df = df.drop_duplicates()