Home > OS >  How to pull data into a pandas df_master from one of several other dfs, where the df from which data
How to pull data into a pandas df_master from one of several other dfs, where the df from which data

Time:02-17

Please see code below.

Overview

I have a dataframe 'df_master' that contains four columns: Date; the name of a specific df from which data should be pulled; measurement on specified Date; and measurement one year after specified Date.

What I am trying to do is write code so that, for each row in df_master, the code pulls measurement data into df_master from one of a large number of other dfs. Each row will have a different df from which it is pulling measurement data. I want to populate two new columns in df_master. One would be the measurement data pulled from the specified df for the date listed in that row in df_master, and the other would be measurement data on the date that is one year in the future.

Example

For instance, consider the code below. The first row in df_master is '2016-01-01' and 'df_B'. This means that the first row of df_master should populate with a value of 250 for 'Measurement_Today' and 265 for 'Measurement_One_Year_in_Future'.

The next row of df_master would need to pull the Measurement data from df_C, etc. In other words, each row of df_master would specify the df from which the Measurement data would come from, and that df differs by row.

Assume that there are thousands of dfs from which df_master must pull data, and combining all of those thousands of dfs into one df is not possible.

I have tried to write a for loop that would pull this data in without success. I also tried to use .iloc to pull in the data for one year in the future, but that is not working, either.

Any help you can provide would be greatly appreciated. Thank you.

Code

# Import dependencies
import pandas as pd
import numpy as np

# Create 'df_A', 'df_B', 'df_C' and 'df_D', which contain measurement data on specific dates.
df_A = pd.DataFrame(np.array([['2016-01-01', 150], ['2017-01-01', 145], 
                              ['2018-01-01', 163], ['2019-01-01', 170],
                              ['2020-01-01', 198], ['2021-01-01', 189],]),
                   columns=['Date', 'Measurement'])
df_A['Date'] = pd.to_datetime(df_A['Date'])

df_B = pd.DataFrame(np.array([['2016-01-01', 250], ['2017-01-01', 265], 
                              ['2018-01-01', 221], ['2019-01-01', 285],
                              ['2020-01-01', 298], ['2021-01-01', 289],]),
                   columns=['Date', 'Measurement'])
df_B['Date'] = pd.to_datetime(df_B['Date'])

df_C = pd.DataFrame(np.array([['2016-01-01', 350], ['2017-01-01', 367], 
                              ['2018-01-01', 392], ['2019-01-01', 370],
                              ['2020-01-01', 398], ['2021-01-01', 389],]),
                   columns=['Date', 'Measurement'])
df_C['Date'] = pd.to_datetime(df_C['Date'])

df_D = pd.DataFrame(np.array([['2016-01-01', 450], ['2017-01-01', 454], 
                              ['2018-01-01', 413], ['2019-01-01', 480],
                              ['2020-01-01', 498], ['2021-01-01', 489],]),
                   columns=['Date', 'Measurement'])
df_D['Date'] = pd.to_datetime(df_D['Date'])


# Create df_master 
df_master = pd.DataFrame(np.array([['2016-01-01', 'df_B','','' ], ['2017-01-01', 'df_C','',''  ], 
                                   ['2018-01-01', 'df_B','','' ], ['2019-01-01', 'df_A','','' ],
                                   ['2018-01-01', 'df_A','','' ], ['2019-01-01', 'df_D','','' ],]),
                   columns=['Date', 'df_to_pull_measurement_from', 'Measurement_Today', 
                            'Measurement_Next_Year'])
df_master['Date'] = pd.to_datetime(df_master['Date'])


# Create list of dfs from df_master['df_to_pull_measurement_from'].
list_of_dfs = df_master['df_to_pull_measurement_from']

### THIS DOES NOT WORK ###
# Add columns to df_master for measurement on given date and measurement one year into future.
for row in list_of_dfs:
    selected_df = row   '[\'Measurement\']'
    df_master['Measurement_Today'] = selected_df
    df_master['Measurement_Next_Year'] = selected_df.iloc[idx  1]

CodePudding user response:

I think using globals for string variables is not good idea, better is create dictionary:

#create DatetimeIndex 
df_master = df_master.set_index('Date')

for row in list_of_dfs:

    selected_df = globals()[row].set_index('Date')['Measurement']
    m1 = df_master['df_to_pull_measurement_from'].eq(row)
    nexty = selected_df.rename(lambda x: x - pd.offsets.DateOffset(years= 1))

    df_master.loc[m1, 'Measurement_Today'] = selected_df
    df_master.loc[m1, 'Measurement_Next_Year'] = nexty
    

#create DatetimeIndex 
df_master = df_master.set_index('Date')

#dict of DataFrames
dfs = {'df_A':df_A, 'df_B':df_B,'df_C':df_C, 'df_D':df_D}
#create DatetimeIndex in each DataFrame
dfs1 = {k: v.set_index('Date') for k, v in dfs.items()}

for row in list_of_dfs:

    selected_df = dfs1[row]['Measurement']
    m1 = df_master['df_to_pull_measurement_from'].eq(row)
    #for next years is subtract one year from DatetimeIndex
    nexty = selected_df.rename(lambda x: x - pd.offsets.DateOffset(years= 1))

    df_master.loc[m1, 'Measurement_Today'] = selected_df
    df_master.loc[m1, 'Measurement_Next_Year'] = nexty

print (df_master)

           df_to_pull_measurement_from Measurement_Today Measurement_Next_Year
Date                                                                          
2016-01-01                        df_B               250                   265
2017-01-01                        df_C               367                   392
2018-01-01                        df_B               221                   285
2019-01-01                        df_A               170                   198
2018-01-01                        df_A               163                   170
2019-01-01                        df_D               480                   498
  • Related