Home > Blockchain >  Conditionally match two dataframes of unequal size
Conditionally match two dataframes of unequal size

Time:11-24

I have two dataframes of unequal size with examples below. We will refer to the first dataframe as df1:

Miles   AB  Param1  Param2  Param3  Param4
1.5 A   0.12345 0.12345 0.12345 0.12345
1.7 B   0.12345 0.12345 0.12345 0.12345
1.9 A   0.12345 0.12345 0.12345 0.12345
2.6 A   0.12345 0.12345 0.12345 0.12345
2.7 B   0.12345 0.12345 0.12345 0.12345
3.5 B   0.12345 0.12345 0.12345 0.12345
5.6 A   0.12345 0.12345 0.12345 0.12345
7.8 B   0.12345 0.12345 0.12345 0.12345
9.9 B   0.12345 0.12345 0.12345 0.12345

Then a similar dataframe that has some similar information, but has a different column parameter AND is smaller than the dataframe above, for example as below, which we will refer to as df2:

Miles   AB  Param5
1.9 A   56
2.5 A   69
3.4 B   42
5.6 A   41

What I am trying to do is add Param5 where applicable to the first table, with the conditions that the values in AB are the same AND the mile difference between the two files is no more than 0.1 (ideally the miles are equal, but based on the original sources of these files, sometimes they can be off by 0.1).

Miles   AB  Param1  Param2  Param3  Param4  Param5
1.5 A   0.12345 0.12345 0.12345 0.12345 
1.7 B   0.12345 0.12345 0.12345 0.12345 
1.9 A   0.12345 0.12345 0.12345 0.12345 56
2.6 A   0.12345 0.12345 0.12345 0.12345 69
2.7 B   0.12345 0.12345 0.12345 0.12345 
3.5 B   0.12345 0.12345 0.12345 0.12345 42
5.6 A   0.12345 0.12345 0.12345 0.12345 41
7.8 B   0.12345 0.12345 0.12345 0.12345 
9.9 B   0.12345 0.12345 0.12345 0.12345 

I am trying to use an np.where function like below but it obviously doesn't work in its current state given that the two dataframes have unequal numbers of rows:

df1['Param5'] = np.where(((df2['AB']==df1['AB']) & (abs(df2['Miles']-df1['Miles'])<=0.1)), df2['Param5'],'')

This then returns ValueError: Can only compare identically-labeled Series objects

Any assistance would be much appreciated how I could get the two dataframes indexed together with the condition that the AB values are the same and the mile difference is no more than 0.1.

Thank you in advance!

CodePudding user response:

Use a merge_asof:

(pd.merge_asof(df1.reset_index().sort_values(by='Miles'),
               df2.sort_values(by='Miles'),
               by='AB', on='Miles',
               direction='nearest', tolerance=0.1)
   .set_index('index').sort_index()
)

Output:

   Miles AB   Param1   Param2   Param3   Param4  Param5
0    1.5  A  0.12345  0.12345  0.12345  0.12345     NaN
1    1.7  B  0.12345  0.12345  0.12345  0.12345     NaN
2    1.9  A  0.12345  0.12345  0.12345  0.12345    56.0
3    2.6  A  0.12345  0.12345  0.12345  0.12345     NaN
4    2.7  B  0.12345  0.12345  0.12345  0.12345     NaN
5    3.5  B  0.12345  0.12345  0.12345  0.12345     NaN
6    5.6  A  0.12345  0.12345  0.12345  0.12345    41.0
7    7.8  B  0.12345  0.12345  0.12345  0.12345     NaN
8    9.9  B  0.12345  0.12345  0.12345  0.12345     NaN

Output with tolerance=0.10001:

   Miles AB   Param1   Param2   Param3   Param4  Param5
0    1.5  A  0.12345  0.12345  0.12345  0.12345     NaN
1    1.7  B  0.12345  0.12345  0.12345  0.12345     NaN
2    1.9  A  0.12345  0.12345  0.12345  0.12345    56.0
3    2.6  A  0.12345  0.12345  0.12345  0.12345    69.0
4    2.7  B  0.12345  0.12345  0.12345  0.12345     NaN
5    3.5  B  0.12345  0.12345  0.12345  0.12345    42.0
6    5.6  A  0.12345  0.12345  0.12345  0.12345    41.0
7    7.8  B  0.12345  0.12345  0.12345  0.12345     NaN
8    9.9  B  0.12345  0.12345  0.12345  0.12345     NaN
  • Related