I have a python code to search data from a very small table in a larger table with two nested 'for' loops. In my opinion it is not efficient.
With pandas it could be much faster, but the way I am doing the merge adds the duplicate values to the output dataframe. How can I add to the dataframe only the first match?, that is, get the same result as with the nested 'for' loops
this the 'main_table' dataframe and I need to search each value of the 'hotel name column' in the 'hotel Name' of the 'table_to_search' dataframe
main_table:
Discount hotel name
0 0.00 hotel A
1 0.10 hotel B
2 0.10 hotel C
3 0.15 hotel D
4 0.20 hotel E
5 0.30 hotel F
6 0.30 hotel G
7 1.00 hotel H
with the loop I get the following result:
['Groups', 'Offline TA/TO', 'Online TA', 'Online TA', 'Offline TA/TO', 'Offline TA/TO', 'Online TA', 'Direct']
and with pandas I'm getting the following:
out:
Discount hotel name vlookup_column
0 0.00 hotel A Groups
1 0.00 hotel A Online TA
2 0.10 hotel B Offline TA/TO
3 0.10 hotel C Online TA
4 0.10 hotel C Online TA
5 0.10 hotel C Corporate
6 0.15 hotel D Online TA
7 0.20 hotel E Offline TA/TO
8 0.20 hotel E Direct
9 0.20 hotel E Online TA
10 0.30 hotel F Offline TA/TO
11 0.30 hotel F Online TA
12 0.30 hotel G Online TA
13 1.00 hotel H Direct
As you can see the output dataframe contains the repeated items.
The dataframe I want is the following:
out:
Discount hotel name vlookup_column
0 0.00 hotel A Groups
1 0.10 hotel B Offline TA/TO
2 0.10 hotel C Online TA
3 0.15 hotel D Online TA
4 0.20 hotel E Offline TA/TO
5 0.30 hotel F Offline TA/TO
6 0.30 hotel G Online TA
7 1.00 hotel H Direct
this is my python
code:
import pandas as pd
from datetime import datetime
start_time = datetime.now()
file_path = r'/Users/myuser/Desktop/VLOOKUP_TEST/hotel_data.xlsx'
main_table = pd.read_excel(file_path, 'market_segment', header = 0)
print(main_table)
table_to_search = pd.read_excel(file_path, '2018', header = 0)
print ('\nVLOOKUP with loop')
loop_vlookup = []
for k in range(0, len(main_table['hotel name'])):
status = False
for item in range(0, len(table_to_search['hotel Name'])) :
if str(main_table['hotel name'][k]) == table_to_search['hotel Name'][item]:
print('coincidence')
loop_vlookup.append(table_to_search['market_segment'][item])
status = True
break
if not status:
loop_vlookup.append("")
print("\n",loop_vlookup)
# Pandas VLOOKUP
print("\nPandas VLOOKUP")
mapping = {'hotel Name': 'hotel name', 'market_segment' : 'vlookup_column'}
out = main_table.merge(table_to_search.rename(columns=mapping)[['hotel name','vlookup_column']], on = 'hotel name')
print(out)
How could I modify the pandas merge
to get only the first match?
here you can find the excel file: https://drive.google.com/drive/folders/1bbXQtRvWDq84rOf2xBwRkVvSOVULF2Jt?usp=sharing
CodePudding user response:
Maybe this is what you need. Also, never use for loop in pandas. Read more on pandas merge
#Merge based on "hotel name" key
main_table = pd.read_excel(file_path, 'market_segment', header = 0)
ref_table = pd.read_excel(file_path, '2018', header = 0)
df = pd.merge(main_table, ref_table, on="hotel name", how="left")
#keep only first results
df = df.drop_duplicates(subset=["Discount", "hotel name"], keep="first")
CodePudding user response:
With your out
new = out.drop_duplicates(['discount', 'hotel name'])