Home > Blockchain >  How to pull data from python pandas df1 to df2 based on values in each row of df2, sort of like a VL
How to pull data from python pandas df1 to df2 based on values in each row of df2, sort of like a VL

Time:02-20

Assume I have two dataframes, df1 and df2, described as follows. See code below that creates each of these dfs.

df1

  • Has 5,000 rows and 10,000 columns.
  • The first column contains a list of non-sequential dates. The dates are listed oldest to newest, but not every day is listed (i.e., only some days are listed). Each date is unique.
  • Each column is labeled with a different person's name. Each column name is unique.
  • All columns other than the Date column contain a number value.

df2

  • Has 2,000,000 rows and 4 columns.
  • The first column contains a list of dates. These are NOT sorted by oldest to newest.
  • The next column contains a person's name (which is listed as the column name in one of the columns of df1).
  • The other two columns contain data about that person based on the date listed in the row.

My Objective

  • I want to populate the two blank columns in df2 using data pulled from df1.
  • For instance, the first row of df2 lists a date of 2017-05-15 and a Person named Person4. I want to populate df2['Value_Today'] with 4752. I want to populate df2['Value_2_records_later'] with 4866.
  • For the next row of df2 (with Date of 2019-01-28 and Person named Person1, I want to populate df2['Value_Today'] with 1918. I want to populate df2['Value_2_records_later'] with 1912.
  • I want to do this for all 2 million rows in df2, so I assume that a for loop is a bad idea.

Any help would be greatly appreciated. Thank you!

Code

# Import dependencies
import pandas as pd
import numpy as np

# Create df1 
df1 = pd.DataFrame(np.array([['2016-05-03', 1651,2653,3655,4658,5655], 
                             ['2017-05-29', 1751,2752,3754,4755, 5759], 
                             ['2018-08-22', 1889, 2882,3887, 4884, 5882], 
                             ['2019-06-28', 1966, 2965, 3966, 4960, 5963],
                             ['2018-11-15', 1811, 2811, 3811, 4811, 5811], 
                             ['2019-12-31', 1912, 2912, 3912, 4912, 5912],
                             ['2016-07-05', 1672, 2678, 3679, 4672, 5674], 
                             ['2017-05-15', 1755, 2750, 3759, 4752, 5755], 
                             ['2018-06-10', 1860, 2864, 3866, 4866, 5867], 
                             ['2019-01-28', 1918, 2910, 3914, 4911, 5918],
                             ['2018-11-30', 1812, 2812, 3812, 4812, 5812], 
                             ['2019-01-03', 1915, 2917, 3916, 4916, 5917],]),
                   columns=['Date', 'Person1', 'Person2', 'Person3', 'Person4', 
                            'Person5',])
# Format df1['Date'] col as datetime
df1['Date'] = pd.to_datetime(df1['Date'])
# Sort df1 by 'Date'
df1 = df1.sort_values(['Date'],ascending=[True]).reset_index(drop=True)

# Create 'df2', which contains measurement data on specific dates.
df2 = pd.DataFrame(np.array([['2017-05-15', 'Person4', '', ''], ['2019-01-28    ', 'Person1', '', ''], 
                              ['2018-11-15', 'Person1', '', ''], ['2018-08-22', 'Person3', '', ''],
                              ['2017-05-15', 'Person5', '', ''], ['2016-05-03', 'Person2', '', ''],]),
                   columns=['Date', 'Person', 'Value_Today', 'Value_2_records_later'])
df2['Date'] = pd.to_datetime(df2['Date'])

# Display dfs
display(df1)
display(df2)

### I DON'T KNOW WHAT CODE I NEED TO SOLVE MY ISSUE ###

# To capture the row that is two rows below, I think I would use the '.shift(-2)' function?

CodePudding user response:

Solution with MultiIndex.map:

  • Set the index of df1 to Date
  • Stack the dataframe to create multiindex mapping series s1. The index of this series will be the combination of date and name of the person. Similarly create another series s2.
  • Set the index of df2 to Date and Person columns
  • Substitute the values in the index of df2 using the values from s1 and s2 and assign the corresponding results to Value_Today and Value_2_records_later
s1 = df1.set_index('Date').stack()
s2 = df1.set_index('Date').shift(-2).stack()
ix = df2.set_index(['Date', 'Person']).index

df2['Value_Today'] = ix.map(s1)
df2['Value_2_records_later'] = ix.map(s2)

Result

print(df2)

        Date   Person Value_Today Value_2_records_later
0 2017-05-15  Person4        4752                  4866
1 2019-01-28  Person1        1918                  1912
2 2018-11-15  Person1        1811                  1915
3 2018-08-22  Person3        3887                  3812
4 2017-05-15  Person5        5755                  5867
5 2016-05-03  Person2        2653                  2750

CodePudding user response:

First, copy the values once for Value_2_records_later,

step1 = df1.set_index('Date')
persons = step1.columns.tolist()

c1 = [('Value_Today', p) for p in persons]
c2 = [('Value_2_records_later', p) for p in persons]

step1.columns = pd.MultiIndex.from_tuples(c1, names=('','Person'))
step1[c2] = step1[c1].shift(-2)

Then stack to move columns to rows

step1.stack()
  • Related