Pandas: Simple way of summing values of columns of same category-CodePudding

I have a data frame of say the share of sales for 3 products (millions of lines in real case) by country and I want the share of sales by continent for each product.

Data are like this one:

df = pd.DataFrame({"fr": [.2, .1, .3], "ge": [.2, .2, .2], "uk": [.2, .3, .2], "us": [.2, .2, .1], "br": [.2, .2, .2]})

And I have a table associating each country to a continent like this one:

country = pd.DataFrame({"continent": ["Europe", "Europe", "Europe", "Europe", "Europe", "Americas", "Americas", "Americas", "Asia", "Asia"], "country": ["fr", "ge", "uk", "es", "pt", "us", "br", "ca", "cn", "jp"]})

How can I do this simply?

I thought about transposing, merging, summing and transposing back but I guess there are simpler ways…

Thank you

CodePudding user response：

I tried my first guess (using transpose) and works well too:

df_tmp = df.transpose().merge(country, how="left", left_index=True, right_on="country")

Output:

    0       1       2       continent   country
0   0.20    0.10    0.30    Europe      fr
1   0.20    0.20    0.20    Europe      ge
2   0.20    0.30    0.20    Europe      uk
5   0.20    0.20    0.10    Americas    us
6   0.20    0.20    0.20    Americas    br

Then:

df_tmp.groupby("continent").sum().transpose()

CodePudding user response：

First, transform the first Dataframe:

df = df.unstack().reset_index()

Output:

   level_0  level_1    0
0       fr        0  0.2
1       fr        1  0.1
2       fr        2  0.3
3       ge        0  0.2
4       ge        1  0.2
5       ge        2  0.2
6       uk        0  0.2
7       uk        1  0.3
8       uk        2  0.2
9       us        0  0.2
10      us        1  0.2
11      us        2  0.1
12      br        0  0.2
13      br        1  0.2
14      br        2  0.2

Rename the columns:

df = df.rename(
  {"level_0": "country", "level_1": "product", 0: "sales"},
  axis=1
)

Output:

   country  product  sales
0       fr        0    0.2
1       fr        1    0.1
2       fr        2    0.3
3       ge        0    0.2
4       ge        1    0.2
5       ge        2    0.2
6       uk        0    0.2
7       uk        1    0.3
8       uk        2    0.2
9       us        0    0.2
10      us        1    0.2
11      us        2    0.1
12      br        0    0.2
13      br        1    0.2
14      br        2    0.2

Then merge with the continent data:

merged = df.merge(country, on="country")

Output:

   country  product  sales continent
0       fr        0    0.2    Europe
1       fr        1    0.1    Europe
2       fr        2    0.3    Europe
3       ge        0    0.2    Europe
4       ge        1    0.2    Europe
5       ge        2    0.2    Europe
6       uk        0    0.2    Europe
7       uk        1    0.3    Europe
8       uk        2    0.2    Europe
9       us        0    0.2  Americas
10      us        1    0.2  Americas
11      us        2    0.1  Americas
12      br        0    0.2  Americas
13      br        1    0.2  Americas
14      br        2    0.2  Americas

Then group by continent and other column and sum:

continent_sums = merged.groupby(["continent", "product"]).sum()

Output:

                   sales
continent product
Americas  0        0.4
          1        0.4
          2        0.3
Europe    0        0.6
          1        0.6
          2        0.7

To get the share, divide by the total sum:

shares = continent_sums / merged["sales"].sum()

Output:

                   sales
continent product
Americas  0        0.133333
          1        0.133333
          2        0.100000
Europe    0        0.200000
          1        0.200000
          2        0.233333