I am trying to loop over my dataframe and looking for additional 3 rows for each element in df.con
which is only looping over 2nd elementUS
and missing UK
.
Please find the attached code.
import pandas as pd
d = { 'year': [2019,2019,2019,2020,2020,2020],
'age group': ['(0-14)','(14-50)','(50 )','(0-14)','(14-50)','(50 )'],
'con': ['UK','UK','UK','US','US','US'],
'population': [10,20,300,400,1000,2000]}
df = pd.DataFrame(data=d)
df2 = df.copy()
df
year age group con population
0 2019 (0-14) UK 10
1 2019 (14-50) UK 20
2 2019 (50 ) UK 300
3 2020 (0-14) US 400
4 2020 (14-50) US 1000
5 2020 (50 ) US 2000
n_df_2 = df.copy()
con_list = [x for x in df.con]
year_list = [x for x in df.year]
age_list = [x for x in df['age group']]
new_list = ['young vs child','old vs young', 'unemployed vs working']
for country in df.con:
bev_child = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[0]))]
bev_work = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[1]))]
bev_old = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[2]))]
bev_child.loc[:,'population'] = bev_work.loc[:,'population'].max() / bev_child.loc[:,'population'].max()
bev_child.loc[:,'con'] = country '-' new_list[0]
bev_child.loc[:,'age group'] = new_list[0]
s = n_df_2.append(bev_child, ignore_index=True)
bev_child.loc[:,'population'] = bev_child.loc[:,'population'].max() bev_old.loc[:,'population'].max()/ bev_work.loc[:,'population'].max()
bev_child.loc[:,'con'] = country '-' new_list[2]
bev_child.loc[:,'age group'] = new_list[2]
s = s.append(bev_child, ignore_index=True)
bev_child.loc[:,'population'] = bev_old.loc[:,'population'].max() / bev_work.loc[:,'population'].max()
bev_child.loc[:,'con'] = country '-' new_list[1]
bev_child.loc[:,'age group'] = new_list[1]
s = s.append(bev_child, ignore_index=True)
s
output missing UK rows...
year age group con population
0 2019 (0-14) UK 10.0
1 2019 (14-50) UK 20.0
2 2019 (50 ) UK 300.0
3 2020 (0-14) US 400.0
4 2020 (14-50) US 1000.0
5 2020 (50 ) US 2000.0
6 2020 young vs child US-young vs child 2.5
7 2020 unemployed vs working US-unemployed vs working 4.5
8 2020 old vs young US-old vs young 2.0
CodePudding user response:
Each time through the loop, s
is re-initialized to a new dataframe on this line:
s = n_df_2.append(bev_child, ignore_index=True)
This makes s
end up as the original value of n_df_2
, plus only the three values that are appended to it the last time the loop body is executed.
I think this is closer to what you want (nothing before the loop changes):
for country in df.con.unique():
bev_child = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[0]))]
bev_work = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[1]))]
bev_old = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[2]))]
bev_child.loc[:, 'population'] = bev_work.loc[:, 'population'].max() / bev_child.loc[:, 'population'].max()
bev_child.loc[:, 'con'] = country '-' new_list[0]
bev_child.loc[:, 'age group'] = new_list[0]
n_df_2 = n_df_2.append(bev_child, ignore_index=True)
bev_child.loc[:, 'population'] = bev_child.loc[:, 'population'].max() bev_old.loc[:,
'population'].max() / bev_work.loc[:,
'population'].max()
bev_child.loc[:, 'con'] = country '-' new_list[2]
bev_child.loc[:, 'age group'] = new_list[2]
n_df_2 = n_df_2.append(bev_child, ignore_index=True)
bev_child.loc[:, 'population'] = bev_old.loc[:, 'population'].max() / bev_work.loc[:, 'population'].max()
bev_child.loc[:, 'con'] = country '-' new_list[1]
bev_child.loc[:, 'age group'] = new_list[1]
n_df_2 = n_df_2.append(bev_child, ignore_index=True)
print(n_df_2)
Output:
year age group con population
0 2019 (0-14) UK 10.0
1 2019 (14-50) UK 20.0
2 2019 (50 ) UK 300.0
3 2020 (0-14) US 400.0
4 2020 (14-50) US 1000.0
5 2020 (50 ) US 2000.0
6 2019 young vs child UK-young vs child 2.0
7 2019 unemployed vs working UK-unemployed vs working 17.0
8 2019 old vs young UK-old vs young 15.0
9 2020 young vs child US-young vs child 2.5
10 2020 unemployed vs working US-unemployed vs working 4.5
11 2020 old vs young US-old vs young 2.0
Note that this only loops through the unique values in df.con
, so the loop body only runs twice. Three records are added to the output each time the loop runs. Note also that the output is appended to n_df_2
, so there's not need for the variable s
.