Home > other >  Vlookup in Python with approximate match in a large dataframe
Vlookup in Python with approximate match in a large dataframe

Time:09-06

I have the following problem.

I have the base dataframe (df):

Type Bid Ask Hora Data Two Seconds
Buyer 5711.0 5711.5 09:00:49 2022-01-07 09:00:51
Seller 5710.0 5710.5 09:00:52 2022-01-07 09:00:54
Buyer 5707.5 5708.0 09:00:53 2022-01-07 09:00:55
Buyer 5700.5 5701.0 17:59:59 2022-01-14 18:00:01

And I have the dataframe with prices (prices):

Bid Ask Hora Data
5713.0 5713.5 09:00:51 2022-01-07
5708.0 5708.5 09:00:55 2022-01-07
5703.5 5704.0 18:00:00 2022-01-14

I would like to do vlookup column Two Seconds in df prices with approximate match. The expected result would be:

Type Bid Ask Hora Data Two Seconds
Buyer 5711.0 5711.5 09:00:49 2022-01-07 5713.5
Seller 5710.0 5710.5 09:00:52 2022-01-07 5713.0
Buyer 5707.5 5708.0 09:00:53 2022-01-07 5708.5
Buyer 5700.5 5701.0 17:59:59 2022-01-14 5704.0

I have tried the following, but I could not approximate match and my two dataframes are very large, so this code is very slow:

df["Two_seconds"] = df.apply(lambda x: prices.loc[(prices["Data"] == x["Data"]) & (prices["Hora"] == x["Two_seconds"]),'Bid' if x["Tipo"] == "Buyer" else 'Ask'], axis = 1)

How I can solve this problem?

thx

CodePudding user response:

Here you are using vague merge. Try to look at pd.merge_asof.

CodePudding user response:

TLDR Version

Try a different strategy. Don't worry about the Lamda function right now. Right now, you are trying to use two lines of code to complete your function. Try instead to see if you can complete it in at least less than 100 lines of code. Then stop to review and see if you can tighten it up later.

Full Version

I'm going to be completely transparent with my answer, most of this is over my head for the simple reason of how new I am to programming. With that being said, I think I might be able to share some insights that might still help.

I noticed you are using a Lamda function and from the looks of things, it looks like you are trying to consolidate your code into as few lines as possible. And while that is a wonderful end goal, often when we start, we need to expand things out ... a lot. I watched every lecture given for two different Harvard programming classes (specifically their CS50 class and their CS50Python class) and there was a programming strategy that I walked away with from both of those classes, that have to wonder if it might be beneficial here.

Start by just trying to get your code working. Don't worry about optimizing it or bothering with trying to get it implemented in as few lines as possible. Just get it working. From there, you stop and look at your code to see what you can do to improve it.

CodePudding user response:

You could try as follows:

# setup

import pandas as pd
import numpy as np

data = {'Type': {0: 'Buyer', 1: 'Seller', 2: 'Buyer', 3: 'Buyer'}, 
        'Bid': {0: 5711.0, 1: 5710.0, 2: 5707.5, 3: 5700.5}, 
        'Ask': {0: 5711.5, 1: 5710.5, 2: 5708.0, 3: 5701.0}, 
        'Hora': {0: '09:00:49', 1: '09:00:52', 2: '09:00:53', 3: '17:59:59'}, 
        'Data': {0: '2022-01-07', 1: '2022-01-07', 2: '2022-01-07', 
                 3: '2022-01-14'}, 
        'Two Seconds': {0: '09:00:51', 1: '09:00:54', 2: '09:00:55', 
                        3: '18:00:01'}
        }

df = pd.DataFrame(data)

data2 = {'Bid': {0: 5713.0, 1: 5708.0, 2: 5703.5}, 
         'Ask': {0: 5713.5, 1: 5708.5, 2: 5704.0}, 
         'Hora': {0: '09:00:51', 1: '09:00:55', 2: '18:00:00'}, 
         'Data': {0: '2022-01-07', 1: '2022-01-07', 2: '2022-01-14'}
         }

df2 = pd.DataFrame(data2)

# code

df['Two Seconds'] = pd.to_datetime(df['Two Seconds'], format='%H:%M:%S')
df2['Hora'] = pd.to_datetime(df2['Hora'], format='%H:%M:%S')

df = pd.merge_asof(df, df2, left_on='Two Seconds', 
                   right_on='Hora', suffixes=('','_y'))

df['Two Seconds'] = np.where(df['Type'] == 'Buyer', df['Ask_y'], df['Bid_y'])

df.drop([col for col in df.columns if col.endswith('_y')], axis=1, inplace=True)

print(df)

     Type     Bid     Ask      Hora        Data  Two Seconds
0   Buyer  5711.0  5711.5  09:00:49  2022-01-07       5713.5
1  Seller  5710.0  5710.5  09:00:52  2022-01-07       5713.0
2   Buyer  5707.5  5708.0  09:00:53  2022-01-07       5708.5
3   Buyer  5700.5  5701.0  17:59:59  2022-01-14       5704.0
  • Related