Reconciliation in pandas: unique and above tolerance rows-CodePudding

I have a dataframe with data from 2 sources stacked (unsorted). Is there a way to identify rows which are unique (have no counterpart in other data source) or where the difference in value between both sources is above a tolerance level?

import pandas as pd
import numpy as np

df = pd.DataFrame({
        'date': ['2017-02-22', '2019-05-07', '2019-05-07', '2018-01-01', '2020-03-10', '2020-03-10'],
        'identifier': ['A', 'A', 'A', 'A', 'A', 'A'],
        'value': [123, 456, 455, 678, 999.9876, 900.1234],
        'source': ['x', 'x', 'y', 'y', 'x', 'y']
        })

tolerance = {'A': 2.50}
key = ['date', 'identifier']
df.sample(frac=1)  # to make it little more difficult, shuffle rows

         date identifier     value source
0  2017-02-22          A  123.0000      x  # NOK - unique
1  2019-05-07          A  456.0000      x  # OK - within tolerance
2  2019-05-07          A  455.0000      y  # OK - within tolerance
3  2018-01-01          A  678.0000      y  # NOK - unique
4  2020-03-10          A  999.9876      x  # NOK - outside tolerance
5  2020-03-10          A  900.1234      y  # NOK - outside tolerance

Expected output

df.iloc[[0,3,4,5]]    
    
         date identifier     value source
0  2017-02-22          A  123.0000      x
3  2018-01-01          A  678.0000      y
4  2020-03-10          A  999.9876      x
5  2020-03-10          A  900.1234      y

Implementation

df['new_value'] = np.where(df['source'] == 'y', -1 * df['value'], df['value'])
df.groupby(key).agg({'source': 'count', 'new_value': 'sum'})


Out[57]: 
                       source  new_value
date       identifier                   
2017-02-22 A                1   123.0000
2018-01-01 A                1  -678.0000
2019-05-07 A                2     1.0000
2020-03-10 A                2    99.8642

I do not know how to proceed further.

CodePudding user response：

You can use a self merge_asof:

df2 = df.sort_values(by='value')
out = pd.merge_asof(df2, df2.reset_index(),
                    by=['date', 'identifier'], on='value',
                    allow_exact_matches=False, direction='nearest',
                    tolerance=2.5).dropna(subset='index')

new_df = df.drop(out['index'])

output:

         date identifier     value source
0  2017-02-22          A  123.0000      x
3  2018-01-01          A  678.0000      y
4  2020-03-10          A  999.9876      x
5  2020-03-10          A  900.1234      y