I want to turn a dataframe from this
to this:
It took me a while to figure out the melt and transpose function to get to this
But I did not get to manage to apply the years from 1990 to 2019 in a repeating manner into for every of the 189 countries.
I tried:
year_list = []
for year in range(1990, 2020,1):
year_list.append(year)
years = pd.Series(year_list)
years
and then
df['year'] = years.repeat(30)
(I need to repeat it 30 times, because the frame consists of 5670 rows = 189 countries * 29 years)
I got this error message:
ValueError: cannot reindex on an axis with duplicate labels
Googling this error does not help.
CodePudding user response:
One approach could be as follows:
Sample data
import pandas as pd
import numpy as np
data = {'country': ['Afghanistan','Angola']}
data.update({k: np.random.rand() for k in range(1990,1993)})
df = pd.DataFrame(data)
print(df)
country 1990 1991 1992
0 Afghanistan 0.103589 0.950523 0.323925
1 Angola 0.103589 0.950523 0.323925
Code
res = (df.set_index('country')
.unstack()
.sort_index(level=1)
.reset_index(drop=False)
.rename(columns={'country': 'geo',
'level_0': 'time',
0: 'hdi_human_development_index'})
)
print(res)
time geo hdi_human_development_index
0 1990 Afghanistan 0.103589
1 1991 Afghanistan 0.950523
2 1992 Afghanistan 0.323925
3 1990 Angola 0.103589
4 1991 Angola 0.950523
5 1992 Angola 0.323925
Explanation
- Use
df.set_index
on columncountry
and applydf.unstack
to add the years from the column names to the index. - Now, we use
df.sort_index
onlevel=1
to get the countries in alphabetical order. - Finally, we use
df.reset_index
withdrop
parameter set toFalse
to get the index back as columns, and we chaindf.rename
to customize the column names.