I have two dataframes of unequal size with examples below. We will refer to the first dataframe as df1:
Miles AB Param1 Param2 Param3 Param4
1.5 A 0.12345 0.12345 0.12345 0.12345
1.7 B 0.12345 0.12345 0.12345 0.12345
1.9 A 0.12345 0.12345 0.12345 0.12345
2.6 A 0.12345 0.12345 0.12345 0.12345
2.7 B 0.12345 0.12345 0.12345 0.12345
3.5 B 0.12345 0.12345 0.12345 0.12345
5.6 A 0.12345 0.12345 0.12345 0.12345
7.8 B 0.12345 0.12345 0.12345 0.12345
9.9 B 0.12345 0.12345 0.12345 0.12345
Then a similar dataframe that has some similar information, but has a different column parameter AND is smaller than the dataframe above, for example as below, which we will refer to as df2:
Miles AB Param5
1.9 A 56
2.5 A 69
3.4 B 42
5.6 A 41
What I am trying to do is add Param5 where applicable to the first table, with the conditions that the values in AB are the same AND the mile difference between the two files is no more than 0.1 (ideally the miles are equal, but based on the original sources of these files, sometimes they can be off by 0.1).
Miles AB Param1 Param2 Param3 Param4 Param5
1.5 A 0.12345 0.12345 0.12345 0.12345
1.7 B 0.12345 0.12345 0.12345 0.12345
1.9 A 0.12345 0.12345 0.12345 0.12345 56
2.6 A 0.12345 0.12345 0.12345 0.12345 69
2.7 B 0.12345 0.12345 0.12345 0.12345
3.5 B 0.12345 0.12345 0.12345 0.12345 42
5.6 A 0.12345 0.12345 0.12345 0.12345 41
7.8 B 0.12345 0.12345 0.12345 0.12345
9.9 B 0.12345 0.12345 0.12345 0.12345
I am trying to use an np.where function like below but it obviously doesn't work in its current state given that the two dataframes have unequal numbers of rows:
df1['Param5'] = np.where(((df2['AB']==df1['AB']) & (abs(df2['Miles']-df1['Miles'])<=0.1)), df2['Param5'],'')
This then returns ValueError: Can only compare identically-labeled Series objects
Any assistance would be much appreciated how I could get the two dataframes indexed together with the condition that the AB values are the same and the mile difference is no more than 0.1.
Thank you in advance!
CodePudding user response:
Use a merge_asof
:
(pd.merge_asof(df1.reset_index().sort_values(by='Miles'),
df2.sort_values(by='Miles'),
by='AB', on='Miles',
direction='nearest', tolerance=0.1)
.set_index('index').sort_index()
)
Output:
Miles AB Param1 Param2 Param3 Param4 Param5
0 1.5 A 0.12345 0.12345 0.12345 0.12345 NaN
1 1.7 B 0.12345 0.12345 0.12345 0.12345 NaN
2 1.9 A 0.12345 0.12345 0.12345 0.12345 56.0
3 2.6 A 0.12345 0.12345 0.12345 0.12345 NaN
4 2.7 B 0.12345 0.12345 0.12345 0.12345 NaN
5 3.5 B 0.12345 0.12345 0.12345 0.12345 NaN
6 5.6 A 0.12345 0.12345 0.12345 0.12345 41.0
7 7.8 B 0.12345 0.12345 0.12345 0.12345 NaN
8 9.9 B 0.12345 0.12345 0.12345 0.12345 NaN
Output with tolerance=0.10001
:
Miles AB Param1 Param2 Param3 Param4 Param5
0 1.5 A 0.12345 0.12345 0.12345 0.12345 NaN
1 1.7 B 0.12345 0.12345 0.12345 0.12345 NaN
2 1.9 A 0.12345 0.12345 0.12345 0.12345 56.0
3 2.6 A 0.12345 0.12345 0.12345 0.12345 69.0
4 2.7 B 0.12345 0.12345 0.12345 0.12345 NaN
5 3.5 B 0.12345 0.12345 0.12345 0.12345 42.0
6 5.6 A 0.12345 0.12345 0.12345 0.12345 41.0
7 7.8 B 0.12345 0.12345 0.12345 0.12345 NaN
8 9.9 B 0.12345 0.12345 0.12345 0.12345 NaN