I have a dataframe with some price indices across 5 years, from 2017 to 2021. It looks like this:
Country | Industry | Year | Index |
---|---|---|---|
US | Agriculture | 2017 | 83 |
US | Agriculture | 2018 | 97.2 |
US | Agriculture | 2019 | 100 |
US | Agriculture | 2020 | 112 |
US | Agriculture | 2021 | 108 |
Japan | Mining | 2017 | 88 |
Japan | Mining | 2018 | 93 |
Japan | Mining | 2019 | 100 |
Japan | Mining | 2020 | 104 |
Japan | Mining | 2021 | 112 |
My base year is 2019, hence the Index for every row tagged with 2019 is 100. Everything else moves up or down. I want to generate another column called Percentage_Change
showing the year on year change starting from 2019 as the base year.
I tried using the pd.series.pct_change
function, however, that calculates the year on year percentage change starting with 2017 and it generates an NaN
value for all rows where the year is 2017, instead of 2019 which should be the base year.
I want the output to look like this:
Country | Industry | Year | Index | Percentage_change |
---|---|---|---|---|
Japan | Mining | 2017 | 88 | -5.37% |
Japan | Mining | 2018 | 93 | -7% |
Japan | Mining | 2019 | 100 | 0 |
Japan | Mining | 2020 | 104 | 4% |
Japan | Mining | 2021 | 112 | 7.69% |
The percentage_change
for Japan between 2021 and 2020 is (112-104)/104 = 7.69%, the difference between 2020 and 2019 is (104-100)/100 = 4%, the difference between 2018 and 2019 is (93-100)/100 = -7%, the difference between 2017 and 2018 is (88-93)/93 = -5.37%
Is there any other way of calculating % change in pandas?
CodePudding user response:
pct_change
is computing a change relative to the previous value (which is why 2017 is NaN), and this doesn't seem to be what you want. If you want to compute a percentage change relative to 2019, as 2019 is already normalized to 100, simply subtract 100:
df['Percentage_Change'] = df['Index'].sub(100)
output:
Country Industry Year Index Percentage_Change
0 US Agriculture 2017 83.0 -17.0
1 US Agriculture 2018 97.2 -2.8
2 US Agriculture 2019 100.0 0.0
3 US Agriculture 2020 112.0 12.0
4 US Agriculture 2021 108.0 8.0
5 Japan Mining 2017 88.0 -12.0
6 Japan Mining 2018 93.0 -7.0
7 Japan Mining 2019 100.0 0.0
8 Japan Mining 2020 104.0 4.0
9 Japan Mining 2021 112.0 12.0
bidirectional pct_change
If you want a bidirectional pct_change
"centered" on 100, you can use masks to compute the pct_change
both ways:
df['Percentage_Change'] = (df
.assign(ref=df['Year'].eq(2019))
.groupby(['Country', 'Industry'], group_keys=False)
.apply(lambda g: g['Index'].where(g['ref'].cummax()).pct_change()
.fillna(g['Index'][::-1].pct_change().mask(g['ref'].cummax(), 0))
)
)
output:
Country Industry Year Index Percentage_Change
0 US Agriculture 2017 83.0 -0.146091
1 US Agriculture 2018 97.2 -0.028000
2 US Agriculture 2019 100.0 0.000000
3 US Agriculture 2020 112.0 0.120000
4 US Agriculture 2021 108.0 -0.035714
5 Japan Mining 2017 88.0 -0.053763
6 Japan Mining 2018 93.0 -0.070000
7 Japan Mining 2019 100.0 0.000000
8 Japan Mining 2020 104.0 0.040000
9 Japan Mining 2021 112.0 0.076923