Please see code below.
Overview
I have a dataframe 'df_master' that contains four columns: Date; the name of a specific df from which data should be pulled; measurement on specified Date; and measurement one year after specified Date.
What I am trying to do is write code so that, for each row in df_master, the code pulls measurement data into df_master from one of a large number of other dfs. Each row will have a different df from which it is pulling measurement data. I want to populate two new columns in df_master. One would be the measurement data pulled from the specified df for the date listed in that row in df_master, and the other would be measurement data on the date that is one year in the future.
Example
For instance, consider the code below. The first row in df_master is '2016-01-01' and 'df_B'. This means that the first row of df_master should populate with a value of 250 for 'Measurement_Today' and 265 for 'Measurement_One_Year_in_Future'.
The next row of df_master would need to pull the Measurement data from df_C, etc. In other words, each row of df_master would specify the df from which the Measurement data would come from, and that df differs by row.
Assume that there are thousands of dfs from which df_master must pull data, and combining all of those thousands of dfs into one df is not possible.
I have tried to write a for loop that would pull this data in without success. I also tried to use .iloc to pull in the data for one year in the future, but that is not working, either.
Any help you can provide would be greatly appreciated. Thank you.
Code
# Import dependencies
import pandas as pd
import numpy as np
# Create 'df_A', 'df_B', 'df_C' and 'df_D', which contain measurement data on specific dates.
df_A = pd.DataFrame(np.array([['2016-01-01', 150], ['2017-01-01', 145],
['2018-01-01', 163], ['2019-01-01', 170],
['2020-01-01', 198], ['2021-01-01', 189],]),
columns=['Date', 'Measurement'])
df_A['Date'] = pd.to_datetime(df_A['Date'])
df_B = pd.DataFrame(np.array([['2016-01-01', 250], ['2017-01-01', 265],
['2018-01-01', 221], ['2019-01-01', 285],
['2020-01-01', 298], ['2021-01-01', 289],]),
columns=['Date', 'Measurement'])
df_B['Date'] = pd.to_datetime(df_B['Date'])
df_C = pd.DataFrame(np.array([['2016-01-01', 350], ['2017-01-01', 367],
['2018-01-01', 392], ['2019-01-01', 370],
['2020-01-01', 398], ['2021-01-01', 389],]),
columns=['Date', 'Measurement'])
df_C['Date'] = pd.to_datetime(df_C['Date'])
df_D = pd.DataFrame(np.array([['2016-01-01', 450], ['2017-01-01', 454],
['2018-01-01', 413], ['2019-01-01', 480],
['2020-01-01', 498], ['2021-01-01', 489],]),
columns=['Date', 'Measurement'])
df_D['Date'] = pd.to_datetime(df_D['Date'])
# Create df_master
df_master = pd.DataFrame(np.array([['2016-01-01', 'df_B','','' ], ['2017-01-01', 'df_C','','' ],
['2018-01-01', 'df_B','','' ], ['2019-01-01', 'df_A','','' ],
['2018-01-01', 'df_A','','' ], ['2019-01-01', 'df_D','','' ],]),
columns=['Date', 'df_to_pull_measurement_from', 'Measurement_Today',
'Measurement_Next_Year'])
df_master['Date'] = pd.to_datetime(df_master['Date'])
# Create list of dfs from df_master['df_to_pull_measurement_from'].
list_of_dfs = df_master['df_to_pull_measurement_from']
### THIS DOES NOT WORK ###
# Add columns to df_master for measurement on given date and measurement one year into future.
for row in list_of_dfs:
selected_df = row '[\'Measurement\']'
df_master['Measurement_Today'] = selected_df
df_master['Measurement_Next_Year'] = selected_df.iloc[idx 1]
CodePudding user response:
I think using globals for string variables is not good idea, better is create dictionary:
#create DatetimeIndex
df_master = df_master.set_index('Date')
for row in list_of_dfs:
selected_df = globals()[row].set_index('Date')['Measurement']
m1 = df_master['df_to_pull_measurement_from'].eq(row)
nexty = selected_df.rename(lambda x: x - pd.offsets.DateOffset(years= 1))
df_master.loc[m1, 'Measurement_Today'] = selected_df
df_master.loc[m1, 'Measurement_Next_Year'] = nexty
#create DatetimeIndex
df_master = df_master.set_index('Date')
#dict of DataFrames
dfs = {'df_A':df_A, 'df_B':df_B,'df_C':df_C, 'df_D':df_D}
#create DatetimeIndex in each DataFrame
dfs1 = {k: v.set_index('Date') for k, v in dfs.items()}
for row in list_of_dfs:
selected_df = dfs1[row]['Measurement']
m1 = df_master['df_to_pull_measurement_from'].eq(row)
#for next years is subtract one year from DatetimeIndex
nexty = selected_df.rename(lambda x: x - pd.offsets.DateOffset(years= 1))
df_master.loc[m1, 'Measurement_Today'] = selected_df
df_master.loc[m1, 'Measurement_Next_Year'] = nexty
print (df_master)
df_to_pull_measurement_from Measurement_Today Measurement_Next_Year
Date
2016-01-01 df_B 250 265
2017-01-01 df_C 367 392
2018-01-01 df_B 221 285
2019-01-01 df_A 170 198
2018-01-01 df_A 163 170
2019-01-01 df_D 480 498