I am a beginner in numpy and I have two csv files that look like this:
csv1:
ID | item_size | Cost |
---|---|---|
0010 | 4.4 | |
0010 | 5.5 | |
0012 | 8 | |
0012 | 10.1 |
csv2:
ID | item_size | Cost |
---|---|---|
0010 | 6.1 | 5 |
0010 | 7.2 | 2 |
0010 | 5.3 | 1 |
0010 | 3.1 | 3 |
0010 | 4.7 | 2 |
0012 | 7.6 | 5 |
0012 | 22 | 4 |
0012 | 13.1 | 2 |
0012 | 9.2 | 3 |
0012 | 11.1 | 3 |
The output should look like this: csv_output:
ID | item_size | Cost |
---|---|---|
0010 | 4.4 | 2 |
0010 | 5.5 | 1 |
0012 | 8 | 5 |
0012 | 10.1 | 3 |
"The closest value for item_size for 4.4 is 4.7, they also have the same ID so the Cost column is filled with a 2."
The task is to match the ID between the csv1 and csv2. Also to get the closest item size and add the cost. Mantaining the csv1 as the output csv.
So I have to fill the missing cost of csv1 based on the nearest item_size value between csv1 and csv2. Also, they have to belong to the same ID. I tried separating this problem into several tasks such as: nearest value problem, similar ID problem and filling the cost problem. Then, to combine all of them.
I already solved the nearest value problem. Using this script:
import numpy as np
def closest_value(input_list, input_value):
arr = np.asarray(input_list)
i = (np.abs(arr - input_value)).argmin()
return arr[i]
list1 = dataset_1['item_size'].values.tolist()
list2 = dataset_2['item_size'].values.tolist()
for i in range(len(list2)):
val=closest_value(list1,list2[i])
print("The closest value to the " str(list[i]) " is",val)
I also tried solving the similar ID problem using this:
import pandas as pd
csv1 = pd.read_csv('csv1.csv')
csv2= pd.read_csv('csv2.csv')
csv2[csv2['ID'].duplicated(keep=False)]
csv2.groupby('ID',axis=0).apply(lambda x: x.to_csv(str(x.name) '.csv'))
However, this method creates csv files based on ID similarity which is too computational resourceful. if there is another method to solve this, I will gladly take it. I have been trying to solve this problem for at least 5 days so any help is appreciated. Best Regards
CodePudding user response:
You can use a merge_asof
:
# save the index to restore it later
# sort the data for the merge_asof (required)
(pd.merge_asof(df1.reset_index().sort_values(by='item_size').drop(columns='Cost'),
df2.sort_values(by='item_size'),
by='ID', on='item_size', direction='nearest'
)
.set_index('index').sort_index() # restore original order
)
Output:
ID item_size Cost
index
0 10 4.4 2
1 10 5.5 1
2 12 8.0 5
3 12 10.1 3