How to filter my dataframe in specific way with another dataframe?-CodePudding

I have a dataframe df1:

id1   id2
a1    b1  
c1    d1 
e1    d1
g1    h1

and df2:

id    value
a1     10
b1     9
c1     7
d1     11
e1     12
g1     5
h1     8

I want to keep rows from df1 only if their values from value column in df2 differ (gap) no higher than 1. So desired output is:

id1   id2
a1    b1  
e1    d1

row c1 d1 was removed since gap between 7 and 11 is higher than 1. same thing with g1 h1. How to do that?

CodePudding user response：

Here's one way using boolean indexing. The idea is to stack the Ids' in df1 get its corresponding values from df2, then filter the rows where the difference is less than 1:

out = df1.loc[df1.stack().map(df2.set_index('id')['value']).droplevel(-1).groupby(level=0).diff().abs().dropna().le(1).pipe(lambda x: x[x].index)]

Output:

  id1 id2
0  a1  b1
2  e1  d1

CodePudding user response：

IIUC:

df1[df1.applymap(df2.set_index('id').value.get).eval('abs(id1 - id2)').le(1)]

  id1 id2
0  a1  b1
2  e1  d1

Longer Answer

# Callable I'll need in `applymap`
# it basically translates `df2` into
# a function that returns `'value'`
# when you pass `'id'`
c = df2.set_index('id').value.get

# `applymap` applies a callable to each dataframe cell
df1_applied = df1.applymap(c)
print(df1_applied)

   id1  id2
0   10    9
1    7   11
2   12   11
3    5    8

# `eval` takes a string argument that describes what
# calculation to do.  See docs for more
df1_applied_evaled = df1_applied.eval('abs(id1 - id2)')
print(df1_applied_evaled)

0    1
1    4
2    1
3    3
dtype: int64

# now just boolean slice your way to the end
df1[df1_applied_evaled.le(1)]

  id1 id2
0  a1  b1
2  e1  d1

CodePudding user response：

It's easy and intuitive to do this with datar, a re-imagining of pandas APIs:

>>> from datar.all import f, tibble, left_join, mutate, abs, filter, select
>>> 
>>> df1 = tibble(
...     id1=["a1", "c1", "e1", "g1"],
...     id2=["b1", "d1", "d1", "h1"],
... )
>>> 
>>> df2 = tibble(
...     id=["a1", "b1", "c1", "d1", "e1", "g1", "h1"],
...     value=[10, 9, 7, 11, 12, 5, 8],
... )
>>> 
>>> (
...     df1 
...     >> left_join(df2, by={"id1": f.id})  # get the values of id1
...     >> left_join(df2, by={"id2": f.id})  # get the values of id2
...     >> mutate(diff=abs(f.value_x - f.value_y))  # calculate the diff
...     >> filter(f.diff <= 1)  # filter with diff <= 1
...     >> select(f.id1, f.id2)  # keep only desired columns
... )
       id1      id2
  <object> <object>
0       a1       b1
2       e1       d1