Home > OS >  Calculating difference of two columns with lag and storing result in other
Calculating difference of two columns with lag and storing result in other

Time:12-23

I have base dataframe that looks like this:

mname  p_code        p_name      fcval     shotdate          actual_1  actual_2  actual_3                                                          
0      101_1210      BankABC     5590890    2015-02-05       10         20       30    
     

and a control dataframe that looks like this:

mname  p_code      p_name      fcval     shotdate        prd_1     prd_2     prd_3
30     101_1210    BankABC     5590890   2015-02-05       15       30        40  

Note: The number of feature/cols like actual and prd are 48 i.e actual_1, actual_2 ... actual_48, same for prd.

There could be multiple dataframes like the control one, the structure stays the same.

I want to calculate the difference between actual_* columns of base by shifting the prd_* columns of control by a lag, and store the result in control dataframe in new column called error. The lag is calulated as

mname//30 = 1 in this case, could be 3 if mname=90 as 90//30=3, the the lag woudl be 3, ie shifting the **prd_*** cells by 3

In the above case, the difference would be like this

actual_1  actual_2  actual_3
            prd_1     prd_2     prd_3

will result in

err1 = actual_2 - prd_1 = 20-15 = 5
err2 = actual_3 - prd_2 = 30-30 = 0
err3 = NaN because there is no matching actual

The resulting dataframe looks like this:

mname  p_code      p_name      fcval     shotdate    mNumber   error
30     101_1210    BankABC     5590890   2015-02-05   1         5
30     101_1210    BankABC     5590890   2015-02-05   2         0
30     101_1210    BankABC     5590890   2015-02-05   3         NaN

Also, if nay of the acutal is nan, then error should be NaN.

I have been trying this with apply and lag but unsuccessfull.

Thanks in advance!

CodePudding user response:

I do not claim that the following method is clean and efficient, however it seems to do the job.

  • calculate the lag
  • create a temporary dataframe with the Actual and Control values
  • shift the control values by the lag
  • remove unnecessary rows
  • create the mnumber column
  • calculate the error and place values in the error column
  • combine the temporary dataframe with that of the control dataframe to produce the expected output.
  • print the new dataframe new_df

Creating the Dataframes:

import pandas as pd
import numpy as np

base_df = pd.DataFrame({ 'mname': [0],
                         'p_code': ['101_1210'],
                         'p_name': ['BankABC'],
                         'fcval': [5590890],
                         'shotdate': ['2015-02-05'],
                         'actual_1': [10],
                         'actual_2': [20],
                         'actual_3': [30]})

control_df = pd.DataFrame({ 'mname': [30],
                            'p_code': ['101_1210'],
                            'p_name': ['BankABC'],
                            'fcval': [5590890],
                            'shotdate': ['2015-02-05'],
                            'prd_1': [15],
                            'prd_2': [30],
                            'prd_3': [40]})

Code:

# Calculate the lag
lag = control_df.mname[0]//30

# Create a temporary dataframe to perform calculations and work out the error
temp_df = pd.DataFrame({"Actual":base_df.iloc[0,5:]}).reset_index(drop=True)
temp_df.loc[max(temp_df.index) 1, :] = np.nan
temp_df['Control']= control_df.iloc[0,5:].reset_index(drop=True)
temp_df.Control = temp_df.Control.shift(lag)

while pd.isna(temp_df.Control[0]):
    temp_df = temp_df.drop(0, axis=0).reset_index(drop=True)
temp_df['mNumber'] = range(1, len(temp_df) 1)
temp_df['error'] = temp_df['Actual'] - temp_df['Control']

# Combine the error with control dataframe
new_df = pd.DataFrame(control_df.iloc[0, 0:5]).T
new_df = new_df.merge(temp_df, how="cross").drop(['Actual', 'Control'], axis=1)

print(new_df)


Output:

  mname    p_code   p_name    fcval    shotdate  mNumber error
0    30  101_1210  BankABC  5590890  2015-02-05        1     5
1    30  101_1210  BankABC  5590890  2015-02-05        2     0
2    30  101_1210  BankABC  5590890  2015-02-05        3   NaN

CodePudding user response:

It is much easier to handle this kind of operations with stacked dataframes.

First, let's make an example with two rows in control, for sake of completeness:

base = pd.DataFrame({
    'mname': [0], 'p_code': ['101_1210'], 'p_name': ['BankABC'], 'fcval': [5590890],
    'shotdate': ['2015-02-05'], 'actual_1': [10], 'actual_2': [20], 'actual_3': [30],
    'actual_4': [50],
})
control = pd.DataFrame({
    'mname': [30, 90], 'p_code': ['101_1210', '101_1210'], 'p_name': ['BankABC', 'BankABC'],
    'fcval': [5590890, 5590890], 'shotdate': ['2015-02-05', '2015-02-05'],
    'prd_1': [15, 20], 'prd_2': [30, 25], 'prd_3': [40, 65],
})

Then:

def reshape(df, prefix):
    pcols = [k for k in df.columns if k.startswith(prefix)]
    pcint = [int(k[len(prefix):]) for k in pcols]
    others = [k for k in df.columns if not k in pcols]
    df = df.set_index(others).set_axis(pcint, axis=1).rename_axis(columns='lag')
    df = df.stack().to_frame('val').reset_index()
    df['lag']  = df['mname'] // 30
    return df

Example:

>>> reshape(base, 'actual_')
   mname    p_code   p_name    fcval    shotdate  lag  val
0      0  101_1210  BankABC  5590890  2015-02-05    1   10
1      0  101_1210  BankABC  5590890  2015-02-05    2   20
2      0  101_1210  BankABC  5590890  2015-02-05    3   30
3      0  101_1210  BankABC  5590890  2015-02-05    4   50

Now:

a = reshape(base, 'actual_')
b = reshape(control, 'prd_')
on = [k for k in a.columns if k not in {'mname', 'val'}]
z = b.merge(a.drop('mname', axis=1), 'left', on=on)
z['error'] = z['val_y'] - z['val_x']
>>> z
   mname    p_code   p_name    fcval    shotdate  lag  val_x  val_y  error
0     30  101_1210  BankABC  5590890  2015-02-05    2     15   20.0    5.0
1     30  101_1210  BankABC  5590890  2015-02-05    3     30   30.0    0.0
2     30  101_1210  BankABC  5590890  2015-02-05    4     40   50.0   10.0
3     90  101_1210  BankABC  5590890  2015-02-05    4     20   50.0   30.0
4     90  101_1210  BankABC  5590890  2015-02-05    5     25    NaN    NaN
5     90  101_1210  BankABC  5590890  2015-02-05    6     65    NaN    NaN
  • Related