df[percent] = (df['column_name'] / df['column_name'].sum()) * 100
My dataset is:
name_d | year | regime_origin | FLOW | GDP_d |
---|---|---|---|---|
ITALY | 1990 | 0 | 120 | 200 |
ITALY | 1991 | 0 | 239 | 500 |
SPAIN | 1990 | 1 | 123 | 399 |
ITALY | 1993 | 1 | 120 | 200 |
I would like to have the percentage of the FLOW.
I would have:
df[percent] = (df['FLOW'] / df['GDP_d']) * 100
But some source says I have to use .sum(). Why is sum() necessary?
CodePudding user response:
If you're trying to compute the percentage that each value from the column FLOW
has relative to the total flow, then you really need to divide each value by the sum of all values:
df[percent] = (df['column_name']/df['column_name'].sum()) * 100
When you have something like:
df[percent] = (df['FLOW'] / df['GDP_d']) * 100
What you're actually calculating is the value that the FLOW
of each row has, relative to that same row's GDP_d
. So for example, on the first line, a flow of 120 represents 60% of the GDP_d
(200). The second line equals 47.80%, and so on. If FLOW
and GDP_d
columns are relatable in a sense that GDP_d
represents the maximum that every element from FLOW
could have then it's completely fine to use (df['FLOW'] / df['GDP_d']) * 100
.
However if in reality you wanted to find the average percentage of each FLOW
observation (observation = row), in respect to its total population (all values of FLOW
combined), then you really need to divide each value by the sum of all values.
It's important to note that the sum of all values you get when you divide FLOW
by GDP_d
won't add up to 100%. In your example, the sum would be equal to 198.62%, and the average of df[percent]
would be 49.65%. This means that, on average, the value of each row from FLOW
is about 50% of size of GDP_d
.
On the other hand, when you calculate the percentage FLOW
in respect to the total FLOW
population the sum of all values is always equal to 100%.
Here's an comparison:
name_d | year | regime_origin | FLOW | GDP_d | FLOW/GDP_d | EQUATION(FLOW/GDP_d) | FLOW/SUM(FLOW) | EQUATION(FLOW/SUM(FLOW)) | FLOW/SUM(GDP_d) | EQUATION(FLOW/SUM(GDP_d)) | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | ITALY | 1990 | 0 | 120 | 200 | 60 | 120/200 | 19.93 | 120/602 | 9.24 | 120/1299 |
1 | ITALY | 1991 | 0 | 239 | 500 | 47.8 | 239/500 | 39.7 | 239/602 | 18.4 | 239/1299 |
2 | SPAIN | 1990 | 1 | 123 | 399 | 30.83 | 123/399 | 20.43 | 123/602 | 9.47 | 123/1299 |
3 | ITALY | 1993 | 1 | 120 | 200 | 60 | 120/200 | 19.93 | 120/602 | 9.24 | 120/1299 |
import pandas as pd
df = pd.DataFrame(
{
"name_d": ["ITALY", "ITALY", "SPAIN", "ITALY"],
"year": [1990, 1991, 1990, 1993],
"regime_origin": [0, 0, 1, 1],
"FLOW": [120, 239, 123, 120],
"GDP_d": [200, 500, 399, 200],
}
)
df['FLOW/GDP_d'] = ((df['FLOW'] / df['GDP_d']) * 100).round(2)
df['EQUATION(FLOW/GDP_d)'] = df.apply(lambda row: f'{row["FLOW"]:0d}/{row["GDP_d"]:0d}', axis=1)
df['FLOW/SUM(FLOW)'] = ((df['FLOW'] / df['FLOW'].sum()) * 100).round(2)
df['EQUATION(FLOW/SUM(FLOW))'] = df.apply(lambda row: f'{row["FLOW"]:0d}/{df["FLOW"].sum():0d}', axis=1)
df['FLOW/SUM(GDP_d)'] = ((df['FLOW'] / df['GDP_d'].sum()) * 100).round(2)
df['EQUATION(FLOW/SUM(GDP_d))'] = df.apply(lambda row: f'{row["FLOW"]:0d}/{df["GDP_d"].sum():0d}', axis=1)