I have a dataframe df1:
id1 id2
a1 b1
c1 d1
e1 d1
g1 h1
and df2:
id value
a1 10
b1 9
c1 7
d1 11
e1 12
g1 5
h1 8
I want to keep rows from df1 only if their values from value column in df2 differ (gap) no higher than 1. So desired output is:
id1 id2
a1 b1
e1 d1
row c1 d1 was removed since gap between 7 and 11 is higher than 1. same thing with g1 h1. How to do that?
CodePudding user response:
Here's one way using boolean indexing. The idea is to stack
the Ids' in df1
get its corresponding values from df2
, then filter the rows where the difference is less than 1:
out = df1.loc[df1.stack().map(df2.set_index('id')['value']).droplevel(-1).groupby(level=0).diff().abs().dropna().le(1).pipe(lambda x: x[x].index)]
Output:
id1 id2
0 a1 b1
2 e1 d1
CodePudding user response:
IIUC:
df1[df1.applymap(df2.set_index('id').value.get).eval('abs(id1 - id2)').le(1)]
id1 id2
0 a1 b1
2 e1 d1
Longer Answer
# Callable I'll need in `applymap`
# it basically translates `df2` into
# a function that returns `'value'`
# when you pass `'id'`
c = df2.set_index('id').value.get
# `applymap` applies a callable to each dataframe cell
df1_applied = df1.applymap(c)
print(df1_applied)
id1 id2
0 10 9
1 7 11
2 12 11
3 5 8
# `eval` takes a string argument that describes what
# calculation to do. See docs for more
df1_applied_evaled = df1_applied.eval('abs(id1 - id2)')
print(df1_applied_evaled)
0 1
1 4
2 1
3 3
dtype: int64
# now just boolean slice your way to the end
df1[df1_applied_evaled.le(1)]
id1 id2
0 a1 b1
2 e1 d1
CodePudding user response:
It's easy and intuitive to do this with datar
, a re-imagining of pandas APIs:
>>> from datar.all import f, tibble, left_join, mutate, abs, filter, select
>>>
>>> df1 = tibble(
... id1=["a1", "c1", "e1", "g1"],
... id2=["b1", "d1", "d1", "h1"],
... )
>>>
>>> df2 = tibble(
... id=["a1", "b1", "c1", "d1", "e1", "g1", "h1"],
... value=[10, 9, 7, 11, 12, 5, 8],
... )
>>>
>>> (
... df1
... >> left_join(df2, by={"id1": f.id}) # get the values of id1
... >> left_join(df2, by={"id2": f.id}) # get the values of id2
... >> mutate(diff=abs(f.value_x - f.value_y)) # calculate the diff
... >> filter(f.diff <= 1) # filter with diff <= 1
... >> select(f.id1, f.id2) # keep only desired columns
... )
id1 id2
<object> <object>
0 a1 b1
2 e1 d1