Home > Software design >  Why is sum() used when the percentage is computed?
Why is sum() used when the percentage is computed?

Time:09-27

df[percent] = (df['column_name'] / df['column_name'].sum()) * 100

My dataset is:

name_d year regime_origin FLOW GDP_d
ITALY 1990 0 120 200
ITALY 1991 0 239 500
SPAIN 1990 1 123 399
ITALY 1993 1 120 200

I would like to have the percentage of the FLOW.

I would have:

df[percent] = (df['FLOW'] / df['GDP_d']) * 100

But some source says I have to use .sum(). Why is sum() necessary?

CodePudding user response:

If you're trying to compute the percentage that each value from the column FLOW has relative to the total flow, then you really need to divide each value by the sum of all values:

df[percent] = (df['column_name']/df['column_name'].sum()) * 100

When you have something like:

df[percent] = (df['FLOW'] / df['GDP_d']) * 100

What you're actually calculating is the value that the FLOW of each row has, relative to that same row's GDP_d. So for example, on the first line, a flow of 120 represents 60% of the GDP_d (200). The second line equals 47.80%, and so on. If FLOW and GDP_d columns are relatable in a sense that GDP_d represents the maximum that every element from FLOW could have then it's completely fine to use (df['FLOW'] / df['GDP_d']) * 100.

However if in reality you wanted to find the average percentage of each FLOW observation (observation = row), in respect to its total population (all values of FLOW combined), then you really need to divide each value by the sum of all values.

It's important to note that the sum of all values you get when you divide FLOW by GDP_d won't add up to 100%. In your example, the sum would be equal to 198.62%, and the average of df[percent] would be 49.65%. This means that, on average, the value of each row from FLOW is about 50% of size of GDP_d.

On the other hand, when you calculate the percentage FLOW in respect to the total FLOW population the sum of all values is always equal to 100%.

Here's an comparison:

name_d year regime_origin FLOW GDP_d FLOW/GDP_d EQUATION(FLOW/GDP_d) FLOW/SUM(FLOW) EQUATION(FLOW/SUM(FLOW)) FLOW/SUM(GDP_d) EQUATION(FLOW/SUM(GDP_d))
0 ITALY 1990 0 120 200 60 120/200 19.93 120/602 9.24 120/1299
1 ITALY 1991 0 239 500 47.8 239/500 39.7 239/602 18.4 239/1299
2 SPAIN 1990 1 123 399 30.83 123/399 20.43 123/602 9.47 123/1299
3 ITALY 1993 1 120 200 60 120/200 19.93 120/602 9.24 120/1299
import pandas as pd

df = pd.DataFrame(
    {
        "name_d": ["ITALY", "ITALY", "SPAIN", "ITALY"],
        "year": [1990, 1991, 1990, 1993],
        "regime_origin": [0, 0, 1, 1],
        "FLOW": [120, 239, 123, 120],
        "GDP_d": [200, 500, 399, 200],
    }
)


df['FLOW/GDP_d'] = ((df['FLOW'] / df['GDP_d']) * 100).round(2)
df['EQUATION(FLOW/GDP_d)'] = df.apply(lambda row: f'{row["FLOW"]:0d}/{row["GDP_d"]:0d}', axis=1)
df['FLOW/SUM(FLOW)'] = ((df['FLOW'] / df['FLOW'].sum()) * 100).round(2)
df['EQUATION(FLOW/SUM(FLOW))'] = df.apply(lambda row: f'{row["FLOW"]:0d}/{df["FLOW"].sum():0d}', axis=1)
df['FLOW/SUM(GDP_d)'] = ((df['FLOW'] / df['GDP_d'].sum()) * 100).round(2)
df['EQUATION(FLOW/SUM(GDP_d))'] = df.apply(lambda row: f'{row["FLOW"]:0d}/{df["GDP_d"].sum():0d}', axis=1)

  • Related