Home > Enterprise >  Iterate date range function over multiple columns, output results to new columns without explicitly
Iterate date range function over multiple columns, output results to new columns without explicitly

Time:04-12

I was having trouble with syntax and asked the question:

Sum values in one dataframe based on date range in a second dataframe

and received a nice function that created a solution for one column:

    def to_value(row):
      return df1[(row['Start'] <= df1['Date']) & (df1['Date'] <= row['End'])]['Val'].sum()

    df3 = df2.copy()
    df3['Sum_Val'] = df3.apply(to_value, axis=1)

My supervisors would now like to be able to pass a list of start dates (MM-DD) and have the function generate a result for each row from multiple columns and output the result to a new column corresponding to each input column. The new columns will then be exported and used in a different program.

Here is a very simplified example of the datasets:

    import pandas as pd
    import numpy as np

    # ** df1: Dates and Values **
    rng = pd.date_range('2011-11-30', periods=100, freq='D')
    np.random.seed(1)
    rnd = np.random.randint(low=1, high=10, size=100)
    df1 = pd.DataFrame({'Date': rng,
                        'Val': rnd,
                       })

    #** df2: End Dates **
    df2 = pd.DataFrame({'Year': ['2012', '2012', '2012'],
                        'End_Date': ['2012-02-29', '2012-03-04', '2012-03-06'],
                        'YearOffset': ['2011', '2011', '2011']
                       })

The actual datasets are fairly large (18x37K, 52x16K) and the list of dates will change depending on need, e.g., one run might have 12 dates and the next run 31 new dates. Therefore, it is expedient to just create one initial list and have the function iterate through the list. The start dates will all be in a particular year, but the year will change depending on the analysis.

I have dealt with the list issue with the following code:

    # List of start dates that will change
    start_list = ['12-01', '12-10', '12-23'] 

    # Create columns based on the start list with the appropriate year offset
    for i in start_list:
        df2[i] = df2['YearOffset']   '-'   i

    # Create a list of columns to place the results. This block may not be necessary depending how the function is written
    sum_col_list = []
    for i in start_list:
        x = i   '_Sum'
        sum_col_list.append(x)
    df2 = pd.concat([df2, pd.DataFrame(columns=sum_col_list)])

I can apply a function through various means to multiple columns by explicitly stating each column, e.g.,

    def f(x):
        return x**2
    df[['C','D']] = df[['A','B']].apply(f)

but I'm struggling how to iterate through each row of multiple columns inside of a date range function and output the result to corresponding new columns without having to explicitly state each column name.

The desired output would look like:

    #** Desired Output:  df2 -- Date Range with Summed Values **
    df2_1 = pd.DataFrame({
                        'Year': ['2012', '2012', '2012'],
                        'End_Date': ['2012-02-29', '2012-03-04', '2012-03-06'],
                        '12-01': ['12-01-2011', '12-01-2011', '12-01-2011'],
                        '12-10': ['12-10-2011', '12-10-2011', '12-10-2011'],
                        '12-23': ['12-23-2011', '12-23-2011', '12-23-2011'],
                         '12-01_Sum': [482, 499, 511],
                         '12-10_Sum': [440, 457, 469],
                         '12-23_Sum': [369, 386, 398]
                       })

n.b. np.random.seed may not yield the same results other systems

CodePudding user response:

Copying most of what you already have:

import pandas as pd
import numpy as np

# ** df1: Dates and Values **
rng = pd.date_range('2011-11-30', periods=100, freq='D')
np.random.seed(1)
rnd = np.random.randint(low=1, high=10, size=100)
df1 = pd.DataFrame({'Date': rng,
                    'Val': rnd,
                   })

#** df2: End Dates **
df2 = pd.DataFrame({'Year': ['2012', '2012', '2012'],
                    'End_Date': ['2012-02-29', '2012-03-04', '2012-03-06'],
                    'YearOffset': ['2011', '2011', '2011']
                   })


# List of start dates that will change
start_list = ['12-01', '12-10', '12-23'] 

# Create columns based on the start list with the appropriate year offset
for i in start_list:
    df2[i] = df2['YearOffset']   '-'   i

I have then changed your function slightly so it fits the column names.

def to_value(row):
    return df1[(row['Start'] <= df1['Date']) & (df1['Date'] <= row['End_Date'])]['Val'].sum()

Create a new "Start" column, with the start dates, then "explode" it so that there is one on each row, for each end date:

df2["Start"] = df2.iloc[:, 3:].values.tolist()
df2 = df2.explode("Start")

Apply the function:

df2["Sum"] = df2.apply(to_value, axis=1)

Change the "Start" values to be the desired column names:

df2["Start"] = pd.to_datetime(df2["Start"]).dt.strftime("%m-%d")   "_Sum"

Pivot the dataframe to the desired output:

df2 = df2.pivot(index=["Year", "End_Date"] start_list, columns="Start", values="Sum").reset_index()

Output:

#Out: 
#Start  Year    End_Date       12-01  ... 12-01_Sum 12-10_Sum  12-23_Sum
#0      2012  2012-02-29  2011-12-01  ...       482       440        369
#1      2012  2012-03-04  2011-12-01  ...       499       457        386
#2      2012  2012-03-06  2011-12-01  ...       511       469        398

#[3 rows x 8 columns]

There may have been slightly more efficient parts within the code, but I used what you already had and worked from there, rather than re-writing those parts/combining them with later parts.

  • Related