Why is sum() used when the percentage is computed?-CodePudding

df[percent] = (df['column_name'] / df['column_name'].sum()) * 100

My dataset is:

name_d	year	regime_origin	FLOW	GDP_d
ITALY	1990	0	120	200
ITALY	1991	0	239	500
SPAIN	1990	1	123	399
ITALY	1993	1	120	200

I would like to have the percentage of the FLOW.

I would have:

df[percent] = (df['FLOW'] / df['GDP_d']) * 100

But some source says I have to use .sum(). Why is sum() necessary?

CodePudding user response：

If you're trying to compute the percentage that each value from the column FLOW has relative to the total flow, then you really need to divide each value by the sum of all values:

df[percent] = (df['column_name']/df['column_name'].sum()) * 100

When you have something like:

df[percent] = (df['FLOW'] / df['GDP_d']) * 100

What you're actually calculating is the value that the FLOW of each row has, relative to that same row's GDP_d. So for example, on the first line, a flow of 120 represents 60% of the GDP_d (200). The second line equals 47.80%, and so on. If FLOW and GDP_d columns are relatable in a sense that GDP_d represents the maximum that every element from FLOW could have then it's completely fine to use (df['FLOW'] / df['GDP_d']) * 100.

However if in reality you wanted to find the average percentage of each FLOW observation (observation = row), in respect to its total population (all values of FLOW combined), then you really need to divide each value by the sum of all values.

It's important to note that the sum of all values you get when you divide FLOW by GDP_d won't add up to 100%. In your example, the sum would be equal to 198.62%, and the average of df[percent] would be 49.65%. This means that, on average, the value of each row from FLOW is about 50% of size of GDP_d.

On the other hand, when you calculate the percentage FLOW in respect to the total FLOW population the sum of all values is always equal to 100%.

Here's an comparison:

	name_d	year	regime_origin	FLOW	GDP_d	FLOW/GDP_d	EQUATION(FLOW/GDP_d)	FLOW/SUM(FLOW)	EQUATION(FLOW/SUM(FLOW))	FLOW/SUM(GDP_d)	EQUATION(FLOW/SUM(GDP_d))
0	ITALY	1990	0	120	200	60	120/200	19.93	120/602	9.24	120/1299
1	ITALY	1991	0	239	500	47.8	239/500	39.7	239/602	18.4	239/1299
2	SPAIN	1990	1	123	399	30.83	123/399	20.43	123/602	9.47	123/1299
3	ITALY	1993	1	120	200	60	120/200	19.93	120/602	9.24	120/1299

import pandas as pd

df = pd.DataFrame(
    {
        "name_d": ["ITALY", "ITALY", "SPAIN", "ITALY"],
        "year": [1990, 1991, 1990, 1993],
        "regime_origin": [0, 0, 1, 1],
        "FLOW": [120, 239, 123, 120],
        "GDP_d": [200, 500, 399, 200],
    }
)


df['FLOW/GDP_d'] = ((df['FLOW'] / df['GDP_d']) * 100).round(2)
df['EQUATION(FLOW/GDP_d)'] = df.apply(lambda row: f'{row["FLOW"]:0d}/{row["GDP_d"]:0d}', axis=1)
df['FLOW/SUM(FLOW)'] = ((df['FLOW'] / df['FLOW'].sum()) * 100).round(2)
df['EQUATION(FLOW/SUM(FLOW))'] = df.apply(lambda row: f'{row["FLOW"]:0d}/{df["FLOW"].sum():0d}', axis=1)
df['FLOW/SUM(GDP_d)'] = ((df['FLOW'] / df['GDP_d'].sum()) * 100).round(2)
df['EQUATION(FLOW/SUM(GDP_d))'] = df.apply(lambda row: f'{row["FLOW"]:0d}/{df["GDP_d"].sum():0d}', axis=1)