I have a dataframe that looks like this:
df_dict = {'country': ['Japan','Japan','Japan','Japan','Japan','Japan','Japan', 'Greece','Greece','Greece','Greece','Greece','Greece','Greece'],
'year': [1970, 1982, 1999, 2014, 2017, 2018, 2021,1981, 1987, 2002, 2015, 2018, 2019, 2021],
'value': [320, 416, 172, 652, 390, 570, 803, 144, 273, 129, 477, 831, 664,117]}
df = pd.DataFrame(df_dict)
country year value
0 Japan 1970 320
1 Japan 1982 416
2 Japan 1999 172
3 Japan 2014 652
4 Japan 2017 390
5 Japan 2018 570
6 Japan 2021 803
7 Greece 1981 144
8 Greece 1987 273
9 Greece 2002 129
10 Greece 2015 477
11 Greece 2018 831
12 Greece 2019 664
13 Greece 2021 117
I am trying to group the data by year from 2014
onwards, but I can't seem to get it right using groupby(['country','year'])['value']
Practically I want to sum up the values for each country
for each year
greater than or equal to 2014
. So my expected output should look something like this:
country year value
0 Japan 2014 1560
1 Japan 2015 1560
2 Japan 2016 1560
3 Japan 2017 1950
4 Japan 2018 2520
5 Japan 2019 2520
6 Japan 2020 2520
7 Japan 2021 3323
8 Greece 2014 546
9 Greece 2015 1023
10 Greece 2016 1023
11 Greece 2017 1023
12 Greece 2018 1854
13 Greece 2019 2518
14 Greece 2020 2518
15 Greece 2021 2635
Where the value for Japan
in 2014
is the sum of all previous values where year <= 2014
, the value for Japan
in 2015
is the sum of all previous values where year <= 2014
and so on. The last year I would like to sum is 2021
for all countries in the dataframe.
CodePudding user response:
First create MultiIndex
by MultiIndex.from_product
, then convert years lower like 2014
by Series.clip
and aggregate sum
, add missing years by Series.reindex
and use cumulative sum per countries by GroupBy.cumsum
:
mux = pd.MultiIndex.from_product([df['country'].unique(), range(2014, df['year'].max() 1)],
names=['country','year'])
df = (df.groupby(['country',df['year'].clip(lower=2014)])['value']
.sum()
.reindex(mux, fill_value=0)
.groupby(level=0)
.cumsum()
.reset_index())
print (df)
country year value
0 Japan 2014 1560
1 Japan 2015 1560
2 Japan 2016 1560
3 Japan 2017 1950
4 Japan 2018 2520
5 Japan 2019 2520
6 Japan 2020 2520
7 Japan 2021 3323
8 Greece 2014 546
9 Greece 2015 1023
10 Greece 2016 1023
11 Greece 2017 1023
12 Greece 2018 1854
13 Greece 2019 2518
14 Greece 2020 2518
15 Greece 2021 2635
CodePudding user response:
If you don't mind creating new dataframe, you can consider my code below as an alternative.
Iterate over the list of countries and years, and for each combination, calculate the cumulative sum of the value column up to and including that year. You can do this by filtering the dataframe to include only rows with the current country and year <= the current year, and then applying the cumsum()
method. Lastly, append the resulting row to the empty dataframe.
years = list(range(2014, 2022))
countries = df['country'].unique()
result_df = pd.DataFrame(columns=['country', 'year', 'value'])
for country in countries:
for year in years:
df_filtered = df[(df['country'] == country) & (df['year'] <= year)]
cumulative_sum = df_filtered['value'].cumsum().iloc[-1]
result_df = pd.concat([result_df, pd.DataFrame({'country': country, 'year': year, 'value': cumulative_sum}, index=[0])], ignore_index=True)
output:
> result_df country year value 0 Japan 2014 1560 1 Japan 2015 1560 2 Japan 2016 1560 3 Japan 2017 1950 4 Japan 2018 2520 5 Japan 2019 2520 6 Japan 2020 2520 7 Japan 2021 3323 8 Greece 2014 546 9 Greece 2015 1023 10 Greece 2016 1023 11 Greece 2017 1023 12 Greece 2018 1854 13 Greece 2019 2518 14 Greece 2020 2518 15 Greece 2021 2635