Assume I have two dataframes, df1 and df2, described as follows. See code below that creates each of these dfs.
df1
- Has 5,000 rows and 10,000 columns.
- The first column contains a list of non-sequential dates. The dates are listed oldest to newest, but not every day is listed (i.e., only some days are listed). Each date is unique.
- Each column is labeled with a different person's name. Each column name is unique.
- All columns other than the Date column contain a number value.
df2
- Has 2,000,000 rows and 4 columns.
- The first column contains a list of dates. These are NOT sorted by oldest to newest.
- The next column contains a person's name (which is listed as the column name in one of the columns of df1).
- The other two columns contain data about that person based on the date listed in the row.
My Objective
- I want to populate the two blank columns in df2 using data pulled from df1.
- For instance, the first row of df2 lists a date of 2017-05-15 and a Person named Person4. I want to populate df2['Value_Today'] with 4752. I want to populate df2['Value_2_records_later'] with 4866.
- For the next row of df2 (with Date of 2019-01-28 and Person named Person1, I want to populate df2['Value_Today'] with 1918. I want to populate df2['Value_2_records_later'] with 1912.
- I want to do this for all 2 million rows in df2, so I assume that a for loop is a bad idea.
Any help would be greatly appreciated. Thank you!
Code
# Import dependencies
import pandas as pd
import numpy as np
# Create df1
df1 = pd.DataFrame(np.array([['2016-05-03', 1651,2653,3655,4658,5655],
['2017-05-29', 1751,2752,3754,4755, 5759],
['2018-08-22', 1889, 2882,3887, 4884, 5882],
['2019-06-28', 1966, 2965, 3966, 4960, 5963],
['2018-11-15', 1811, 2811, 3811, 4811, 5811],
['2019-12-31', 1912, 2912, 3912, 4912, 5912],
['2016-07-05', 1672, 2678, 3679, 4672, 5674],
['2017-05-15', 1755, 2750, 3759, 4752, 5755],
['2018-06-10', 1860, 2864, 3866, 4866, 5867],
['2019-01-28', 1918, 2910, 3914, 4911, 5918],
['2018-11-30', 1812, 2812, 3812, 4812, 5812],
['2019-01-03', 1915, 2917, 3916, 4916, 5917],]),
columns=['Date', 'Person1', 'Person2', 'Person3', 'Person4',
'Person5',])
# Format df1['Date'] col as datetime
df1['Date'] = pd.to_datetime(df1['Date'])
# Sort df1 by 'Date'
df1 = df1.sort_values(['Date'],ascending=[True]).reset_index(drop=True)
# Create 'df2', which contains measurement data on specific dates.
df2 = pd.DataFrame(np.array([['2017-05-15', 'Person4', '', ''], ['2019-01-28 ', 'Person1', '', ''],
['2018-11-15', 'Person1', '', ''], ['2018-08-22', 'Person3', '', ''],
['2017-05-15', 'Person5', '', ''], ['2016-05-03', 'Person2', '', ''],]),
columns=['Date', 'Person', 'Value_Today', 'Value_2_records_later'])
df2['Date'] = pd.to_datetime(df2['Date'])
# Display dfs
display(df1)
display(df2)
### I DON'T KNOW WHAT CODE I NEED TO SOLVE MY ISSUE ###
# To capture the row that is two rows below, I think I would use the '.shift(-2)' function?
CodePudding user response:
Solution with MultiIndex.map
:
- Set the index of
df1
toDate
- Stack the dataframe to create multiindex mapping series
s1
. The index of this series will be the combination of date and name of the person. Similarly create another seriess2
. - Set the index of
df2
toDate
andPerson
columns - Substitute the values in the index of
df2
using the values froms1
ands2
and assign the corresponding results toValue_Today
andValue_2_records_later
s1 = df1.set_index('Date').stack()
s2 = df1.set_index('Date').shift(-2).stack()
ix = df2.set_index(['Date', 'Person']).index
df2['Value_Today'] = ix.map(s1)
df2['Value_2_records_later'] = ix.map(s2)
Result
print(df2)
Date Person Value_Today Value_2_records_later
0 2017-05-15 Person4 4752 4866
1 2019-01-28 Person1 1918 1912
2 2018-11-15 Person1 1811 1915
3 2018-08-22 Person3 3887 3812
4 2017-05-15 Person5 5755 5867
5 2016-05-03 Person2 2653 2750
CodePudding user response:
First, copy the values once for Value_2_records_later
,
step1 = df1.set_index('Date')
persons = step1.columns.tolist()
c1 = [('Value_Today', p) for p in persons]
c2 = [('Value_2_records_later', p) for p in persons]
step1.columns = pd.MultiIndex.from_tuples(c1, names=('','Person'))
step1[c2] = step1[c1].shift(-2)
Then stack
to move columns to rows
step1.stack()