Nested for loop filtering inner loop based on outer loop and appending dataframe-CodePudding

I am trying to build a dataframe that combines individual dataframes of county-level high school enrollment projections generated in a for loop.

I can do this for a single county, based on this SO question. It works great. My goal now is to do a nested for loop that would take multiple county FIPS codes, filter the inner loop on that, and generate an 11-row dataframe that would then be appended to a master dataframe. For three counties, for example, the final dataframe would be 33 rows.

But I haven't been able to get it right. I've tried to model on this SO question and answer.

This is my starting dataframe:

df = pd.DataFrame({"year": ['2020_21', '2020_21','2020_21'],
    "county_fips" : ['06019','06021','06023'] , 
    "grade11" : [5000,2000,2000],
    "grade12": [5200,2200,2200],
    "grade11_chg": [1.01,1.02,1.03],
    "grade11_12_ratio": [0.9,0.8,0.87]})
df

This is my code with the nested loops. My intent is to run through the county codes in the outer loop and the projection year calculations in the inner loop.

projection_years=['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']

for i in df['county_fips'].unique():
    print(i)
    grade11_change=df.iloc[0]['grade11_chg']
    grade11_12_ratio=df.iloc[0]['grade11_12_ratio']
    full_name=[]
        
    for year in projection_years:
        #print(year)
        df_select=df[df['county_fips']==i]
        lr = df_select.iloc[-1]
        row = {}
        row['year'] = year
        row['county_fips'] = i
        row = {}
        row['grade11'] = int(lr['grade11'] * grade11_change)
        row['grade12'] = int(lr['grade11'] * grade11_12_ratio)
        df_select = df_select.append([row])
        full_name.append(df_select)

df_final=pd.concat(full_name)
df_final=df_final[['year','county_fips','grade11','grade12']]
   
print('Finished processing')

But I end up with NaN values and repeating years. Below shows my desired output (I built this in Excel and the numbers reflect rounding. (Update - this corrects the original df_final_goal .)

df_final_goal=pd.DataFrame({'year': {0: '2020_21',  1: '2021_22',  2: '2022_23',  3: '2023_24',  4: '2024_25',  5: '2025_26',
  6: '2026_27',  7: '2027_28',  8: '2028_29',  9: '2029_30',  10: '2030_31',  11: '2020_21',  12: '2021_22',  13: '2022_23',
  14: '2023_24',  15: '2024_25',  16: '2025_26',  17: '2026_27',  18: '2027_28',  19: '2028_29',  20: '2029_30',  21: '2030_31',
  22: '2020_21',  23: '2021_22',  24: '2022_23',  25: '2023_24',  26: '2024_25',  27: '2025_26',  28: '2026_27',  29: '2027_28',
  30: '2028_29',  31: '2029_30',  32: '2030_31'},
 'county_fips': {0: '06019',  1: '06019',  2: '06019',  3: '06019',  4: '06019',  5: '06019',  6: '06019',  7: '06019',  8: '06019',
  9: '06019',  10: '06019',  11: '06021',  12: '06021',  13: '06021',  14: '06021',  15: '06021',  16: '06021',  17: '06021',  18: '06021',
  19: '06021',  20: '06021',  21: '06021',  22: '06023',  23: '06023',  24: '06023',  25: '06023',  26: '06023',  27: '06023',
  28: '06023',  29: '06023',  30: '06023',  31: '06023',  32: '06023'},
'grade11': {0: 5000,  1: 5050,  2: 5101,  3: 5152,  4: 5203,  5: 5255,  6: 5308,  7: 5361,  8: 5414,  9: 5468, 10: 5523,
  11: 2000,  12: 2040,  13: 2081,  14: 2122,  15: 2165,  16: 2208,  17: 2252,  18: 2297,  19: 2343,  20: 2390,  21: 2438,
  22: 2000,  23: 2060,  24: 2122,  25: 2185,  26: 2251,  27: 2319,  28: 2388,  29: 2460,  30: 2534,  31: 2610,  32: 2688},
 'grade12': {0: 5200,  1: 4500,  2: 4545,  3: 4590,  4: 4636,  5: 4683,  6: 4730,  7: 4777,  8: 4825,  9: 4873,  10: 4922,
  11: 2200,  12: 1600,  13: 1632,  14: 1665,  15: 1698,  16: 1732,  17: 1767,  18: 1802,  19: 1838,  20: 1875,  21: 1912,
  22: 2200,  23: 1740,  24: 1792,  25: 1846,  26: 1901,  27: 1958,  28: 2017,  29: 2078,  30: 2140,  31: 2204,  32: 2270}})

Thanks for any assistance.

CodePudding user response：

there are some bugs in the code, this code seems to produce the result you expect (the final dataframe is currently not consistent with the initial one):

projection_years = ['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']

full_name = []
for i in df['county_fips'].unique():
    print(i)
    grade11_change = df.iloc[0]['grade11_chg']
    grade11_12_ratio = df.iloc[0]['grade11_12_ratio']
    df_select = df[df['county_fips']==i]
        
    for year in projection_years:
        #print(year)
        lr = df_select.iloc[-1]
        row = {}
        row['year'] = year
        row['county_fips'] = i
        row['grade11'] = int(lr['grade11'] * grade11_change)
        row['grade12'] = int(lr['grade11'] * grade11_12_ratio)
        df_select = df_select.append([row])
    full_name.append(df_select)

df_final = pd.concat(full_name)
df_final = df_final[['year','county_fips','grade11','grade12']].reset_index()
   
print('Finished processing')

fixes:

full_name initialized before the outer loop
do not redefine df_select in the inner loop
row was initialized twice inside the inner loop
full_name.append moved outside of the inner loop and after it
added reset_index() to df_final (mostly cosmetic)

CodePudding user response：

Breaking this up into separate functions helps with troubleshooting.

Here's my solution incorporating itertools.product and pandas.shift a couple utility functions

import pandas as pd
from itertools import product


def expand_grade11(
    grade11: int,
    grade11_chg: float,
    len_projection_years: int
) -> list:
    """
    Calculate `grade11` values based on current
    `grade11`, `grade11_chg`, and number of
    `projection_years`.
    """
        
    list_of_vals = []
    while len(list_of_vals) < len_projection_years:
        grade11 = int(grade11 * grade11_chg)
        list_of_vals.append(grade11)
        
    return list_of_vals


def expand_grade12(
    grade11_vals: list,
    grade11_12_ratio: float
) -> list:
    """
    Calculate `grade12` values based 
    on previous `grade11` value and 
    `grade11_12_ratio`.
    """
    
    list_of_vals = []
    for grade11 in grade11_vals:
        grade12 = int(grade11 * grade11_12_ratio)
        list_of_vals.append(grade12)
    
    return list_of_vals


# initial info
df = pd.DataFrame({"year": ['2020_21', '2020_21','2020_21'],
    "county_fips" : ['06019','06021','06023'] , 
    "grade11" : [5000,2000,2000],
    "grade12": [5200,2200,2200],
    "grade11_chg": [1.01,1.02,1.03],
    "grade11_12_ratio": [0.9,0.8,0.87]})
projection_years = ['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']

# cross product `projection_years` and `county_fips`
prods = product(df.county_fips.unique(), projection_years)

# converting to pd.MultiIndex
prods_index = pd.MultiIndex.from_tuples(prods, names=["county_fips", "year"])

# setting index for future grouping/joining
df.set_index(["county_fips", "year"], inplace=True)

# calculate grade11
final = df.groupby([
    "county_fips",
    "year",
]).apply(lambda x: expand_grade11(x.grade11, x.grade11_chg, len(projection_years)))
final = final.explode()
final.index = prods_index
final = final.to_frame("grade11")

# concat with original df to get other columns
final = pd.concat([
    df, final
])
final.sort_index(level=["county_fips", "year"], inplace=True)
final.grade11_12_ratio.ffill(inplace=True)

# calculate grade12
grade12 = final.groupby([
    "county_fips"
]).apply(lambda x: expand_grade12(x["grade11"].tolist(), x["grade11_12_ratio"].tolist()[0]))
grade12 = grade12.explode()
grade12 = grade12.groupby("county_fips").shift(1)
grade12.index = final.index

# put it all together
final.grade12.fillna(grade12, inplace=True)
final = final[["grade11", "grade12"]]
final = final.astype(int)
final.reset_index(inplace=True)