I have number of matches, but only for home matches for given team. How can I sum values for duplica-CodePudding

My dataframe contains number of matches for given fixtures, but only for home matches for given team (i.e. number of matches for Argentina-Uruguay matches is 97, but for Uruguay-Argentina this number is 80). In short I want to sum both numbers of home matches for given teams, so that I have the total number of matches between the teams concerned. The dataframe's top 30 rows looks like this:

most_often = mc.groupby(["home_team", "away_team"]).size().reset_index(name="how_many").sort_values(by=['how_many'], ascending = False)
most_often = most_often.reset_index(drop=True)

most_often.head(30)

   home_team   away_team   how_many
0   Argentina   Uruguay     97
1   Uruguay     Argentina   80
2   Austria     Hungary     69
3   Hungary     Austria     68
4   Kenya       Uganda      65
5   Argentina   Paraguay    64
6   Belgium     Netherlands 63
7   Netherlands Belgium     62
8   England     Scotland    59
9   Argentina   Brazil      58
10  Brazil      Paraguay    58
11  Scotland    England     58
12  Norway      Sweden      56
13  England     Wales       54
14  Sweden      Denmark     54
15  Wales       Scotland    54
16  Denmark     Sweden      53
17  Argentina   Chile       53
18  Scotland    Wales       52
19  Scotland    Northern Ireland    52
20  Sweden      Norway      51
21  Wales       England     50
22  England     Northern Ireland    50
23  Wales       Northern Ireland    50
24  Chile       Uruguay     49
25  Northern Ireland    England 49
26  Brazil      Argentina   48
27  Brazil      Chile       48
28  Brazil      Uruguay     47
29  Chile       Peru        46

In turn, I mean something like this

0   Argentina   Uruguay     177
1   Uruguay     Argentina   177
2   Austria     Hungary     137
3   Hungary     Austria     137
4   Kenya       Uganda      107
5   Uganda      Kenya       107
6   Belgium     Netherlands 105
7   Netherlands Belgium     105

But this is only an example, I want to apply it for every team, which I have on dataframe.

What should I do?

CodePudding user response：

Ok, you can follow steps below.

Here is the initial df.

    home_team   away_team   how_many
0   Argentina   Uruguay 97
1   Uruguay Argentina   80
2   Austria Hungary 69
3   Hungary Austria 68
4   Kenya   Uganda  65

Here you need to create a siorted list that will be the key foraggregations.


df1['sorted_list_team'] = list(zip(df1['home_team'],df1['away_team']))
df1['sorted_list_team'] = df1['sorted_list_team'].apply(lambda x: np.sort(np.unique(x)))



home_team   away_team   how_many    sorted_list_team
0   Argentina   Uruguay 97  [Argentina, Uruguay]
1   Uruguay Argentina   80  [Argentina, Uruguay]
2   Austria Hungary 69  [Austria, Hungary]
3   Hungary Austria 68  [Austria, Hungary]
4   Kenya   Uganda  65  [Kenya, Uganda]

Here you will covert this list to tuple and turn it able to be aggregations.

def converter(list):
    return (*list, )

df1['sorted_list_team'] = df1['sorted_list_team'].apply(converter)

df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()

    sorted_list_team    how_many
0   (Argentina, Brazil) 106
1   (Argentina, Chile)  53
2   (Argentina, Paraguay)   64
3   (Argentina, Uruguay)    177
4   (Austria, Hungary)  137

Do the aggregation to make a sum of 'how_many' values in another dataframe that i call 'df_sum'.

df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()

sorted_list_team    how_many
0   (Argentina, Brazil) 106
1   (Argentina, Chile)  53
2   (Argentina, Paraguay)   64
3   (Argentina, Uruguay)    177
4   (Austria, Hungary)  137

And merge with 'df1' to get the result of a sum, the colum 'how_many' are in both dfs, for this reason pandas rename the column of df_sum as 'how_many_y'

df1 = pd.merge(df1,df_sum[['sorted_list_team','how_many']], on='sorted_list_team',how='left').drop_duplicates()

And final step you need select only columns that you need from result df.

df1 = df1[['home_team','away_team','how_many_y']]
df1 = df1.drop_duplicates()
df1.head()


home_team   away_team   how_many_y
0   Argentina   Uruguay 177
1   Uruguay Argentina   177
2   Austria Hungary 137
3   Hungary Austria 137
4   Kenya   Uganda  65

CodePudding user response：

I found a relatively straightforward thing that hopefully does what you want, but is slightly different than your desired output. Your output has what looks like repetitive information where we aren't caring anymore about home-vs-away team but just want the game counts, and so let's get rid of that distinction (if we can...).

If we make a new column that combines the values from home_team and away_team in the same order each time, we can just do a sum on the how_many where that new column matches

df['teams'] = pd.Series(map('-'.join,np.sort(df[['home_team','away_team']],axis=1)))
# this creates values like 'Argentina-Brazil' and 'Chile-Peru'
df[['how_many','teams']].groupby('teams').sum()

This code gave me the following:

                           how_many
teams
Argentina-Brazil                106
Argentina-Chile                  53
Argentina-Paraguay               64
Argentina-Uruguay               177
Austria-Hungary                 137
Belgium-Netherlands             125
Brazil-Chile                     48
Brazil-Paraguay                  58
Brazil-Uruguay                   47
Chile-Peru                       46
Chile-Uruguay                    49
Denmark-Sweden                  107
England-Northern Ireland         99
England-Scotland                117
England-Wales                   104
Kenya-Uganda                     65
Northern Ireland-Scotland        52
Northern Ireland-Wales           50
Norway-Sweden                   107
Scotland-Wales                  106