Home > Net >  Comparing results between 2 dataframes
Comparing results between 2 dataframes

Time:11-08

I have 2 dataframes:

  1. Dataframe 1: Contains a number of unique securities with its current price.
  2. Dataframe 2: Contains new prices from Bloomberg (hypothetically) within a pricing window for the called securities in df1.

The challenge I am trying to solve is to obtain a new price for each security held in df1 from df2. If the new price for a matching security is within a set tolerance percentage (1%) than that price will be used. Additionally, it needs to take the price with the lowest percentage difference within that pricing window of 10 mins (14:50 to 14:59) for a set date.

Dataframe1:

import pandas as pd
import numpy as np
# Create Internal File 
data = [['1',99.434],['2',99.987],['3',98.117],['4',95.557]]
df = pd.DataFrame(data, columns = ['Security_ID', 'OldPrice'])

Output of df1:

 Security_ID    OldPrice
0          1      99.434
1          2      99.987
2          3      98.117
3          4      95.557

Dataframe2 (Price file from Bloomberg):

data = [['1','10/10/2022 14:50',98.50],
        ['1','10/10/2022 14:51',98.30],
        ['1','10/10/2022 14:52',98.00],
        ['1','10/10/2022 14:53',101.34],
        ['1','10/10/2022 14:54',98.30],
        ['1','10/10/2022 14:55',98.60],
        ['1','10/10/2022 14:56',99.24],
        ['1','10/10/2022 14:57',99.99],
        ['1','10/10/2022 14:58',98.40],
        ['1','10/10/2022 14:59',101.12],
        ['2','10/10/2022 14:50',101.12],
        ['2','10/10/2022 14:51',101.12],
        ['2','10/10/2022 14:52',98.9],
        ['2','10/10/2022 14:53',98.9],
        ['2','10/10/2022 14:54',98.9],
        ['2','10/10/2022 14:55',98.9],
        ['2','10/10/2022 14:56',98.9],
        ['2','10/10/2022 14:57',101.12],
        ['2','10/10/2022 14:58',101.12],
        ['2','10/10/2022 14:59',101.12],
        ['3','10/10/2022 14:50',99.6],
        ['3','10/10/2022 14:51',99.6],
        ['3','10/10/2022 14:52',99.6],
        ['3','10/10/2022 14:53',99.6],
        ['3','10/10/2022 14:54',100],
        ['3','10/10/2022 14:55',98.8],
        ['3','10/10/2022 14:56',99.6],
        ['3','10/10/2022 14:57',99.6],
        ['3','10/10/2022 14:58',99.6],
        ['3','10/10/2022 14:59',99.6]]
df2 = pd.DataFrame(data, columns = ['BB_Security_ID', 'BB_Time', 'BB_NewPrice'])

Output of df2 (df2.head(5):

 BB_Security_ID          BB_Time    BB_NewPrice
0             1 10/10/2022 14:50          98.50
1             1 10/10/2022 14:51          98.30
2             1 10/10/2022 14:52          98.00
3             1 10/10/2022 14:53         101.34
4             1 10/10/2022 14:54          98.30
# change time data type
df2['BB_Time'] = pd.to_datetime(df2['BB_Time'], errors='coerce')
df2['BB_Time'] = pd.to_datetime(df2['BB_Time'].dt.strftime('%d/%m/%Y %H:%M:%S'))
df2['BB_Security_ID']=df2['BB_Security_ID'].astype(str)

# Join dataframes on security id
df = pd.merge(df, df2, left_on = ['Security_ID'],right_on = ['BB_Security_ID'], how = 'left')

# function to calculate percentage difference 
def percentage_dff(col1,col2):
    """
    Calculate percentage difference between 2 columns and return absolute value
    """
    return abs(((col1 - col2)/ col2) * 100)
df['PriceDff'] = percentage_dff(df['OldPrice'],df['BB_NewPrice'])

Output of joined dfs with price difference added (df.head(5))

Security_ID OldPrice    BB_Security_ID              BB_Time BB_NewPrice PriceDff
0         1 99.434                   1  2022-10-10 14:50:00       98.50 0.948223
1         1 99.434                   1  2022-10-10 14:51:00       98.30 1.153611
2         1 99.434                   1  2022-10-10 14:52:00       98.00 1.463265
3         1 99.434                   1  2022-10-10 14:53:00      101.34 1.880797
4         1 99.434                   1  2022-10-10 14:54:00       98.30 1.153611

Desired Outcome: The following output is the expected outcome I am looking to get to that I haven't figured out yet. Any help will be much appreciated.

 Security_ID    OldPrice                BB_Time BB_NewPrice PriceDff    WithinTolerence
0          1      99.434    2022-10-10 14:57:00       99.24 0.195486                  Y
1          2      99.987                    NaN         NaN      NaN                  N
2          3      98.117    2022-10-10 14:55:00       98.80 0.691296                  Y
3          4      95.557                    NaN         NaN      NaN       NoPriceFound

The system can accept the new price if its within 1% tolerance window otherwise no price is accepted for downstream use. If there is a price under the tolerance set, it needs to find the closest match with the lowest % difference

The expected outcome justification:

  • Security ID 1 should accept new price of 99.24 priced at 2022-10-10 14:57:00 as it has the lowest percentage difference, so WithinTolerence set to Y.
  • Security ID 2 does not accept any new prices as they are all outside the tolerance window, so WithinTolerence set to N.
  • Security ID 3 should accept new price of 98.80 priced at 2022-10-10 14:55:00. It is the only price within the pricing window under set tolerance so WithinTolerence set to Y.
  • Security ID 4 - No price available from Bloomberg so the column "WithinTolerence" is labelled as NoPriceFound.

CodePudding user response:

I start at df after you merge df1 and df2.

df.sort_values(by=['Security_ID', 'PriceDff'], inplace=True)
out = df.groupby('Security_ID')[['OldPrice', 'BB_NewPrice', 'PriceDff']].first().reset_index().copy()
out['BB_NewPrice'] = np.where(out['PriceDff'] <= 1.0, out['BB_NewPrice'], np.NaN)
  • Related