I have a dataframe with data from 2 sources stacked (unsorted). Is there a way to identify rows which are unique (have no counterpart in other data source) or where the difference in value between both sources is above a tolerance level?
import pandas as pd
import numpy as np
df = pd.DataFrame({
'date': ['2017-02-22', '2019-05-07', '2019-05-07', '2018-01-01', '2020-03-10', '2020-03-10'],
'identifier': ['A', 'A', 'A', 'A', 'A', 'A'],
'value': [123, 456, 455, 678, 999.9876, 900.1234],
'source': ['x', 'x', 'y', 'y', 'x', 'y']
})
tolerance = {'A': 2.50}
key = ['date', 'identifier']
df.sample(frac=1) # to make it little more difficult, shuffle rows
date identifier value source
0 2017-02-22 A 123.0000 x # NOK - unique
1 2019-05-07 A 456.0000 x # OK - within tolerance
2 2019-05-07 A 455.0000 y # OK - within tolerance
3 2018-01-01 A 678.0000 y # NOK - unique
4 2020-03-10 A 999.9876 x # NOK - outside tolerance
5 2020-03-10 A 900.1234 y # NOK - outside tolerance
Expected output
df.iloc[[0,3,4,5]]
date identifier value source
0 2017-02-22 A 123.0000 x
3 2018-01-01 A 678.0000 y
4 2020-03-10 A 999.9876 x
5 2020-03-10 A 900.1234 y
Implementation
df['new_value'] = np.where(df['source'] == 'y', -1 * df['value'], df['value'])
df.groupby(key).agg({'source': 'count', 'new_value': 'sum'})
Out[57]:
source new_value
date identifier
2017-02-22 A 1 123.0000
2018-01-01 A 1 -678.0000
2019-05-07 A 2 1.0000
2020-03-10 A 2 99.8642
I do not know how to proceed further.
CodePudding user response:
You can use a self merge_asof
:
df2 = df.sort_values(by='value')
out = pd.merge_asof(df2, df2.reset_index(),
by=['date', 'identifier'], on='value',
allow_exact_matches=False, direction='nearest',
tolerance=2.5).dropna(subset='index')
new_df = df.drop(out['index'])
output:
date identifier value source
0 2017-02-22 A 123.0000 x
3 2018-01-01 A 678.0000 y
4 2020-03-10 A 999.9876 x
5 2020-03-10 A 900.1234 y