I am trying to create a panel data frame in Python, e.g. for 5 countries (A, B, C, D, E) each with 3 years of data (2000, 2001, 2002).
import numpy as np
import pandas as pd
df = {'id': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
'country': ['A', 'B', 'C', 'D', 'E', 'A', 'B', 'C', 'D', 'E', 'A', 'B', 'C', 'D', 'E'],
'year': [2000, 2001, 2002, 2000, 2001, 2002, 2000, 2001, 2002, 2000, 2001, 2002, 2000, 2001, 2002]
}
df = pd.DataFrame(df)
df
To extend it to bigger datasets, I am trying to loop using the following codes to obtain the above result, but it is not giving me the desired data frame.
n_country = 5 # number of countries
n_year = 3 # number of years of data for each country
columns = ("id", "country", "year")
n_rows = n_country*n_year
data = pd.DataFrame(np.empty(shape = (n_rows, 3)), columns = columns)
# set country numbers which will identify each country, create country id ranging from 1 to 5
country_id = range(1, 1 n_country)
list(country_id)
# create year from 2000 to 2002
year = range(2000, 2000 n_year)
list(year)
# create dictionary that maps from country id to country name
country_name = dict(zip(country_id, ['A', 'B', 'C', 'D', 'E']))
country_name
# loop starts here
i = 0
for id in country_id:
for country in ["A", "B", "C", "D", "E"]:
for year in [2000, 2001, 2002]:
data.loc[i, "id"] = id
data.loc[i, "year"] = year
data.loc[i, "country"] = country_name[id]
i = 1
The resulting data frame is not what is intended.
I would very much appreciate it if any user could point out the mistake in the loop above.
Thank you!
CodePudding user response:
I would use product on the year/countries then use cat.codes to label the countries.
from itertools import product
import pandas as pd
start_year = 2000
end_year = 2003
countries = ['A','B','C','D','E']
df = pd.DataFrame(list(product(range(start_year,end_year 1),countries)), columns=['year','country'])
df['id'] = df.country.astype('category').cat.codes 1
print(df)
Output
year country id
0 2000 A 1
1 2000 B 2
2 2000 C 3
3 2000 D 4
4 2000 E 5
5 2001 A 1
6 2001 B 2
7 2001 C 3
8 2001 D 4
9 2001 E 5
10 2002 A 1
11 2002 B 2
12 2002 C 3
13 2002 D 4
14 2002 E 5
15 2003 A 1
16 2003 B 2
17 2003 C 3
18 2003 D 4
19 2003 E 5
As for your current loop, you may want to zip
id and country, so that those are reused for each of the year loop, and it needs to be i =1
not i= 1
n_country = 5 # number of countries
n_year = 3 # number of years of data for each country
columns = ("id", "country", "year")
n_rows = n_country*n_year
data = pd.DataFrame(np.empty(shape = (n_rows, 3)), columns = columns)
# set country numbers which will identify each country, create country id ranging from 1 to 5
country_id = range(1, 1 n_country)
list(country_id)
# create year from 2000 to 2002
year = range(2000, 2000 n_year)
list(year)
# create dictionary that maps from country id to country name
country_name = dict(zip(country_id, ['A', 'B', 'C', 'D', 'E']))
country_name
# loop starts here
i = 0
for c_id,country in zip(country_id,["A", "B", "C", "D", "E"]):
print(c_id, country)
for year in [2000, 2001, 2002]:
data.loc[i, "id"] = c_id
data.loc[i, "year"] = year
data.loc[i, "country"] = country
i =1