How to do a VLOOKUP with pandas but only get the first matches?-CodePudding

I have a python code to search data from a very small table in a larger table with two nested 'for' loops. In my opinion it is not efficient.

With pandas it could be much faster, but the way I am doing the merge adds the duplicate values to the output dataframe. How can I add to the dataframe only the first match?, that is, get the same result as with the nested 'for' loops

this the 'main_table' dataframe and I need to search each value of the 'hotel name column' in the 'hotel Name' of the 'table_to_search' dataframe

main_table:

   Discount hotel name
0      0.00    hotel A
1      0.10    hotel B
2      0.10    hotel C
3      0.15    hotel D
4      0.20    hotel E
5      0.30    hotel F
6      0.30    hotel G
7      1.00    hotel H

with the loop I get the following result:

['Groups', 'Offline TA/TO', 'Online TA', 'Online TA', 'Offline TA/TO', 'Offline TA/TO', 'Online TA', 'Direct']

and with pandas I'm getting the following:

out:

    Discount hotel name vlookup_column
0       0.00    hotel A         Groups
1       0.00    hotel A      Online TA
2       0.10    hotel B  Offline TA/TO
3       0.10    hotel C      Online TA
4       0.10    hotel C      Online TA
5       0.10    hotel C      Corporate
6       0.15    hotel D      Online TA
7       0.20    hotel E  Offline TA/TO
8       0.20    hotel E         Direct
9       0.20    hotel E      Online TA
10      0.30    hotel F  Offline TA/TO
11      0.30    hotel F      Online TA
12      0.30    hotel G      Online TA
13      1.00    hotel H         Direct

As you can see the output dataframe contains the repeated items.

The dataframe I want is the following:

out:
    Discount hotel name vlookup_column
0       0.00    hotel A         Groups
1       0.10    hotel B  Offline TA/TO
2       0.10    hotel C      Online TA
3       0.15    hotel D      Online TA
4       0.20    hotel E  Offline TA/TO
5      0.30    hotel F  Offline TA/TO
6      0.30    hotel G      Online TA
7      1.00    hotel H         Direct

this is my python code:

import pandas as pd
from datetime import datetime

start_time = datetime.now()

file_path = r'/Users/myuser/Desktop/VLOOKUP_TEST/hotel_data.xlsx'

main_table = pd.read_excel(file_path, 'market_segment', header = 0)
print(main_table)
table_to_search = pd.read_excel(file_path, '2018', header = 0)

print ('\nVLOOKUP with loop')
loop_vlookup = []
for k in range(0, len(main_table['hotel name'])):
    status = False
    for item in range(0, len(table_to_search['hotel Name'])) :
        if str(main_table['hotel name'][k]) == table_to_search['hotel Name'][item]:
            print('coincidence')
            loop_vlookup.append(table_to_search['market_segment'][item])
            status = True
            break
    if not status:
        loop_vlookup.append("")

print("\n",loop_vlookup)


# Pandas VLOOKUP
print("\nPandas VLOOKUP")
mapping = {'hotel Name': 'hotel name', 'market_segment' : 'vlookup_column'}
out = main_table.merge(table_to_search.rename(columns=mapping)[['hotel name','vlookup_column']], on = 'hotel name')
print(out)

How could I modify the pandas merge to get only the first match?

here you can find the excel file: https://drive.google.com/drive/folders/1bbXQtRvWDq84rOf2xBwRkVvSOVULF2Jt?usp=sharing

CodePudding user response：

Maybe this is what you need. Also, never use for loop in pandas. Read more on pandas merge

#Merge based on "hotel name" key
main_table = pd.read_excel(file_path, 'market_segment', header = 0)
ref_table = pd.read_excel(file_path, '2018', header = 0)
df = pd.merge(main_table, ref_table, on="hotel name", how="left")

#keep only first results
df = df.drop_duplicates(subset=["Discount", "hotel name"], keep="first")

CodePudding user response：

With your out

new = out.drop_duplicates(['discount', 'hotel name'])