Home > Mobile >  Interpolate annual data to hourly as a function in Python
Interpolate annual data to hourly as a function in Python

Time:10-02

I am looking to take annual population data and interpolate it into an hourly time series. I am trying to create a function which produces a time series for each unique name of the hourly population for the sample years given. I have included the code below as well as example data:

import pandas as pd
import random
from scipy.interpolate import interp1d

name = ['RI', 'NH', 'MA', 'RI', 'NH', 'MA','RI', 'NH', 'MA','RI', 'NH', 'MA']
year = [2015, 2015, 2015, 2016, 2016, 2016, 2017, 2017, 2017, 2018, 2018, 2018]
population = random.sample(range(10000, 300000), 12)

df_pop = pd.DataFrame(list(zip(name, year, population)))

start_year = 2015 
end_year = 2018 

def pop_sum(df_pop, start_year, end_year):

    names = df_pop['name'].unique()

    df = pd.DataFrame([])
    for i in names):

        t = df_pop['year']
        y1 = df_pop['population']
        x = pd.DataFrame({'Hours': pd.date_range(f'{start_year}-01-01', f'{end_year}-12-31',
                                                 freq='1H', closed='left')})

        pop_interp = interp1d(t, y1, x, 'linear')
    
        df = df.append(pop_interp)

    return df

This script does not work however and cannot loop over name. I tried looking for resources online but converting from annual to hourly timeseries is far less common than say hourly to annual. I have tried scipy's interp1d but I am open to suggestions of an other packages that may also do the same job. Thank you in advance for you suggestions.

CodePudding user response:

I notice that even though you are looping through an array of names, you are not using the name in the action of the loop. So, you say for i in names but you don't use i in your loop. Because of this, each iteration of your loop will produce the same result as the last, because there is no use of a variable to change the outcome of the iteration.

Since you are appending each iteration to the bottom of an new dataframe, all results will be in the same columns. So, what you can do is pull out a small dataframe for each name, and then use THAT data to do the calculation. You also want to either make the index be the name, or add a column called 'name' for the final dataframe.

Something like

names = df_pop['name'].unique()
df = pd.DataFrame(columns = ['name', 'function'])

for i in names:
    condition = df_pop['name'].str.match(i) # define condition where name is i
    mini_df = df_pop[condition] # all rows where condition is met 
    t = mini_df['year']
    y1 = mini_df['population']
    x = pd.DataFrame({'Hours': pd.date_range(f'{start_year}-01-01', f'{end_year}-12-31', freq='1H', closed='left')})

    pop_interp = interp1d(t, y1, x, 'linear') 
    new_row = {name: i, function: pop_interp} # make a new row to append
    df = df.append(new_row, ignore_index = True) # append it

Assuming the interp1d is what you want - i'm not that familiar with it - I think this structure will work better to get unique results for each name.

CodePudding user response:

You can convert year to datetime, set it as the index, reindex to hourly frequency, and interpolate using Plot of original data and interpolations

  • Related