I have two excel files, I want to perform vlookup and find difference of costs using python or even excel.
My files look like this
source_data.xlsx contains contains distance covered and their price, example distance range from 1 to 100 should be charged 4800 and distance range from 101 to 120 should be charged 5100.
DISTANCE COST
1-100 4800
101-120 5100
121-140 5500
141-160 5900
161-180 6200
181-200 6600
210-220 6900
221-240 7200
Analysis.xlsx
loading_station distance_travel total_cost status
PUGU 40 4000 PAID
PUGU 80 3200 PAID
MOROGORO 50 5000 PAID
MOROGORO 220 30400 PAID
DODOMA 150 5100 PAID
KIGOMA 90 2345 PAID
DODOMA 230 6000 PAID
DODOMA 180 16500 PAID
KIGOMA 32 3000 PAID
DODOMA 45 6000 PAID
DODOMA 65 5000 PAID
KIGOMA 77 1000 PAID
KIGOMA 90 4000 PAID
Actual Cost for distance is given in source_data.xlsx
, I want to check cost in Analysis.xlsx
if it correspond to Actual value, I want to detect underpayment and overpayment.
Desired Output should be like this, with two column added, source_cost
which is taken from source_xlsx
by using vlookup
and difference which is difference between total_cost
and source_cost
loading_station distance_travel total_cost status source_cost Difference
PUGU 40 4000 PAID 4800 -800
PUGU 80 3200 PAID 4800 -1600
MOROGORO 50 5000 PAID 4800 200
MOROGORO 220 30400 PAID 6900 23500
DODOMA 150 5100 PAID 5900 -800
KIGOMA 90 2345 PAID 4800 -2455
DODOMA 230 6000 PAID 7200 -1200
DODOMA 180 16500 PAID 6200 10300
KIGOMA 32 3000 PAID 4800 -1800
DODOMA 45 6000 PAID 4800 1200
DODOMA 65 5000 PAID 4800 200
KIGOMA 77 1000 PAID 4800 -3800
KIGOMA 90 4000 PAID 4800 -800
My code so far
# import pandas
import pandas as pd
# read excel data
source_data = pd.read_excel('source_data.xlsx')
analysis_file = pd.read_excel('analysis.xlsx')
source_data.head(5)
analysis_file.head(5)
CodePudding user response:
Since it is a categorical bins problem, I suggest utilizing cut()
and find the corresponding value.
import pandas as pd
# create bins
bh = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,0]
bt = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,1]
bins = pd.IntervalIndex.from_arrays(bh, bt, closed='both')
print(bins)
###
IntervalIndex([[1, 100], [101, 120], [121, 140], [141, 160], [161, 180], [181, 200], [210, 220], [221, 240]], dtype='interval[int64, both]')
As it shown, IntervalIndex
, dtype='interval[int64, both]'
# find corresponding values
df_analysis['source_cost'] = pd.cut(df_analysis['distance_travel'], bins=bins).map(dict(zip(bins, df_source['COST']))).astype(int)
# calculation
df_analysis['Difference'] = df_analysis['total_cost'] - df_analysis['source_cost']
print(df_analysis)
###
loading_station | distance_travel | total_cost | status | source_cost | Difference |
---|---|---|---|---|---|
PUGU | 40 | 4000 | PAID | 4800 | -800 |
PUGU | 80 | 3200 | PAID | 4800 | -1600 |
MOROGORO | 50 | 5000 | PAID | 4800 | 200 |
MOROGORO | 220 | 30400 | PAID | 6900 | 23500 |
DODOMA | 150 | 5100 | PAID | 5900 | -800 |
KIGOMA | 90 | 2345 | PAID | 4800 | -2455 |
DODOMA | 230 | 6000 | PAID | 7200 | -1200 |
DODOMA | 180 | 16500 | PAID | 6200 | 10300 |
KIGOMA | 32 | 3000 | PAID | 4800 | -1800 |
DODOMA | 45 | 6000 | PAID | 4800 | 1200 |
DODOMA | 65 | 5000 | PAID | 4800 | 200 |
KIGOMA | 77 | 1000 | PAID | 4800 | -3800 |
KIGOMA | 90 | 4000 | PAID | 4800 | -800 |
CodePudding user response:
You can use merge_asof
:
source_data["DISTANCE"] = source_data["DISTANCE"].str.split("-").str[1].astype("int64")
res = (pd.merge_asof(analysis_file.reset_index().sort_values("distance_travel"),
source_data,
left_on="distance_travel",
right_on="DISTANCE",
direction="forward")
.set_index("index")
.sort_index())
res["Difference"] = res["total_cost"] - res["COST"]
print (res)
loading_station distance_travel total_cost status DISTANCE COST Difference
index
0 PUGU 40 4000 PAID 100 4800 -800
1 PUGU 80 3200 PAID 100 4800 -1600
2 MOROGORO 50 5000 PAID 100 4800 200
3 MOROGORO 220 30400 PAID 220 6900 23500
4 DODOMA 150 5100 PAID 160 5900 -800
5 KIGOMA 90 2345 PAID 100 4800 -2455
6 DODOMA 230 6000 PAID 240 7200 -1200
7 DODOMA 180 16500 PAID 180 6200 10300
8 KIGOMA 32 3000 PAID 100 4800 -1800
9 DODOMA 45 6000 PAID 100 4800 1200
10 DODOMA 65 5000 PAID 100 4800 200
11 KIGOMA 77 1000 PAID 100 4800 -3800
12 KIGOMA 90 4000 PAID 100 4800 -800
Note that this does not take care of 0 distance traveled. You need to handle that separately.