I have a data frame of say the share of sales for 3 products (millions of lines in real case) by country and I want the share of sales by continent for each product.
Data are like this one:
df = pd.DataFrame({"fr": [.2, .1, .3], "ge": [.2, .2, .2], "uk": [.2, .3, .2], "us": [.2, .2, .1], "br": [.2, .2, .2]})
And I have a table associating each country to a continent like this one:
country = pd.DataFrame({"continent": ["Europe", "Europe", "Europe", "Europe", "Europe", "Americas", "Americas", "Americas", "Asia", "Asia"], "country": ["fr", "ge", "uk", "es", "pt", "us", "br", "ca", "cn", "jp"]})
How can I do this simply?
I thought about transposing, merging, summing and transposing back but I guess there are simpler ways…
Thank you
CodePudding user response:
I tried my first guess (using transpose) and works well too:
df_tmp = df.transpose().merge(country, how="left", left_index=True, right_on="country")
Output:
0 1 2 continent country
0 0.20 0.10 0.30 Europe fr
1 0.20 0.20 0.20 Europe ge
2 0.20 0.30 0.20 Europe uk
5 0.20 0.20 0.10 Americas us
6 0.20 0.20 0.20 Americas br
Then:
df_tmp.groupby("continent").sum().transpose()
CodePudding user response:
First, transform the first Dataframe:
df = df.unstack().reset_index()
Output:
level_0 level_1 0
0 fr 0 0.2
1 fr 1 0.1
2 fr 2 0.3
3 ge 0 0.2
4 ge 1 0.2
5 ge 2 0.2
6 uk 0 0.2
7 uk 1 0.3
8 uk 2 0.2
9 us 0 0.2
10 us 1 0.2
11 us 2 0.1
12 br 0 0.2
13 br 1 0.2
14 br 2 0.2
Rename the columns:
df = df.rename(
{"level_0": "country", "level_1": "product", 0: "sales"},
axis=1
)
Output:
country product sales
0 fr 0 0.2
1 fr 1 0.1
2 fr 2 0.3
3 ge 0 0.2
4 ge 1 0.2
5 ge 2 0.2
6 uk 0 0.2
7 uk 1 0.3
8 uk 2 0.2
9 us 0 0.2
10 us 1 0.2
11 us 2 0.1
12 br 0 0.2
13 br 1 0.2
14 br 2 0.2
Then merge with the continent data:
merged = df.merge(country, on="country")
Output:
country product sales continent
0 fr 0 0.2 Europe
1 fr 1 0.1 Europe
2 fr 2 0.3 Europe
3 ge 0 0.2 Europe
4 ge 1 0.2 Europe
5 ge 2 0.2 Europe
6 uk 0 0.2 Europe
7 uk 1 0.3 Europe
8 uk 2 0.2 Europe
9 us 0 0.2 Americas
10 us 1 0.2 Americas
11 us 2 0.1 Americas
12 br 0 0.2 Americas
13 br 1 0.2 Americas
14 br 2 0.2 Americas
Then group by continent and other column and sum:
continent_sums = merged.groupby(["continent", "product"]).sum()
Output:
sales
continent product
Americas 0 0.4
1 0.4
2 0.3
Europe 0 0.6
1 0.6
2 0.7
To get the share, divide by the total sum:
shares = continent_sums / merged["sales"].sum()
Output:
sales
continent product
Americas 0 0.133333
1 0.133333
2 0.100000
Europe 0 0.200000
1 0.200000
2 0.233333