I have base dataframe that looks like this:
mname p_code p_name fcval shotdate actual_1 actual_2 actual_3
0 101_1210 BankABC 5590890 2015-02-05 10 20 30
and a control dataframe that looks like this:
mname p_code p_name fcval shotdate prd_1 prd_2 prd_3
30 101_1210 BankABC 5590890 2015-02-05 15 30 40
Note: The number of feature/cols like actual and prd are 48 i.e actual_1, actual_2 ... actual_48, same for prd.
There could be multiple dataframes like the control one, the structure stays the same.
I want to calculate the difference between actual_* columns of base by shifting the prd_* columns of control by a lag, and store the result in control dataframe in new column called error. The lag is calulated as
mname//30 = 1 in this case, could be 3 if mname=90 as 90//30=3, the the lag woudl be 3, ie shifting the **prd_*** cells by 3
In the above case, the difference would be like this
actual_1 actual_2 actual_3
prd_1 prd_2 prd_3
will result in
err1 = actual_2 - prd_1 = 20-15 = 5
err2 = actual_3 - prd_2 = 30-30 = 0
err3 = NaN because there is no matching actual
The resulting dataframe looks like this:
mname p_code p_name fcval shotdate mNumber error
30 101_1210 BankABC 5590890 2015-02-05 1 5
30 101_1210 BankABC 5590890 2015-02-05 2 0
30 101_1210 BankABC 5590890 2015-02-05 3 NaN
Also, if nay of the acutal is nan, then error should be NaN.
I have been trying this with apply and lag but unsuccessfull.
Thanks in advance!
CodePudding user response:
I do not claim that the following method is clean and efficient, however it seems to do the job.
- calculate the
lag
- create a temporary dataframe with the
Actual
andControl
values - shift the control values by the
lag
- remove unnecessary rows
- create the
mnumber
column - calculate the
error
and place values in the error column - combine the temporary dataframe with that of the control dataframe to produce the expected output.
- print the new dataframe
new_df
Creating the Dataframes:
import pandas as pd
import numpy as np
base_df = pd.DataFrame({ 'mname': [0],
'p_code': ['101_1210'],
'p_name': ['BankABC'],
'fcval': [5590890],
'shotdate': ['2015-02-05'],
'actual_1': [10],
'actual_2': [20],
'actual_3': [30]})
control_df = pd.DataFrame({ 'mname': [30],
'p_code': ['101_1210'],
'p_name': ['BankABC'],
'fcval': [5590890],
'shotdate': ['2015-02-05'],
'prd_1': [15],
'prd_2': [30],
'prd_3': [40]})
Code:
# Calculate the lag
lag = control_df.mname[0]//30
# Create a temporary dataframe to perform calculations and work out the error
temp_df = pd.DataFrame({"Actual":base_df.iloc[0,5:]}).reset_index(drop=True)
temp_df.loc[max(temp_df.index) 1, :] = np.nan
temp_df['Control']= control_df.iloc[0,5:].reset_index(drop=True)
temp_df.Control = temp_df.Control.shift(lag)
while pd.isna(temp_df.Control[0]):
temp_df = temp_df.drop(0, axis=0).reset_index(drop=True)
temp_df['mNumber'] = range(1, len(temp_df) 1)
temp_df['error'] = temp_df['Actual'] - temp_df['Control']
# Combine the error with control dataframe
new_df = pd.DataFrame(control_df.iloc[0, 0:5]).T
new_df = new_df.merge(temp_df, how="cross").drop(['Actual', 'Control'], axis=1)
print(new_df)
Output:
mname p_code p_name fcval shotdate mNumber error
0 30 101_1210 BankABC 5590890 2015-02-05 1 5
1 30 101_1210 BankABC 5590890 2015-02-05 2 0
2 30 101_1210 BankABC 5590890 2015-02-05 3 NaN
CodePudding user response:
It is much easier to handle this kind of operations with stacked dataframes.
First, let's make an example with two rows in control
, for sake of completeness:
base = pd.DataFrame({
'mname': [0], 'p_code': ['101_1210'], 'p_name': ['BankABC'], 'fcval': [5590890],
'shotdate': ['2015-02-05'], 'actual_1': [10], 'actual_2': [20], 'actual_3': [30],
'actual_4': [50],
})
control = pd.DataFrame({
'mname': [30, 90], 'p_code': ['101_1210', '101_1210'], 'p_name': ['BankABC', 'BankABC'],
'fcval': [5590890, 5590890], 'shotdate': ['2015-02-05', '2015-02-05'],
'prd_1': [15, 20], 'prd_2': [30, 25], 'prd_3': [40, 65],
})
Then:
def reshape(df, prefix):
pcols = [k for k in df.columns if k.startswith(prefix)]
pcint = [int(k[len(prefix):]) for k in pcols]
others = [k for k in df.columns if not k in pcols]
df = df.set_index(others).set_axis(pcint, axis=1).rename_axis(columns='lag')
df = df.stack().to_frame('val').reset_index()
df['lag'] = df['mname'] // 30
return df
Example:
>>> reshape(base, 'actual_')
mname p_code p_name fcval shotdate lag val
0 0 101_1210 BankABC 5590890 2015-02-05 1 10
1 0 101_1210 BankABC 5590890 2015-02-05 2 20
2 0 101_1210 BankABC 5590890 2015-02-05 3 30
3 0 101_1210 BankABC 5590890 2015-02-05 4 50
Now:
a = reshape(base, 'actual_')
b = reshape(control, 'prd_')
on = [k for k in a.columns if k not in {'mname', 'val'}]
z = b.merge(a.drop('mname', axis=1), 'left', on=on)
z['error'] = z['val_y'] - z['val_x']
>>> z
mname p_code p_name fcval shotdate lag val_x val_y error
0 30 101_1210 BankABC 5590890 2015-02-05 2 15 20.0 5.0
1 30 101_1210 BankABC 5590890 2015-02-05 3 30 30.0 0.0
2 30 101_1210 BankABC 5590890 2015-02-05 4 40 50.0 10.0
3 90 101_1210 BankABC 5590890 2015-02-05 4 20 50.0 30.0
4 90 101_1210 BankABC 5590890 2015-02-05 5 25 NaN NaN
5 90 101_1210 BankABC 5590890 2015-02-05 6 65 NaN NaN