Home > Enterprise >  How to filter my dataframe in specific way with another dataframe?
How to filter my dataframe in specific way with another dataframe?

Time:04-15

I have a dataframe df1:

id1   id2
a1    b1  
c1    d1 
e1    d1
g1    h1   

and df2:

id    value
a1     10
b1     9
c1     7
d1     11
e1     12
g1     5
h1     8

I want to keep rows from df1 only if their values from value column in df2 differ (gap) no higher than 1. So desired output is:

id1   id2
a1    b1  
e1    d1   

row c1 d1 was removed since gap between 7 and 11 is higher than 1. same thing with g1 h1. How to do that?

CodePudding user response:

Here's one way using boolean indexing. The idea is to stack the Ids' in df1 get its corresponding values from df2, then filter the rows where the difference is less than 1:

out = df1.loc[df1.stack().map(df2.set_index('id')['value']).droplevel(-1).groupby(level=0).diff().abs().dropna().le(1).pipe(lambda x: x[x].index)]

Output:

  id1 id2
0  a1  b1
2  e1  d1

CodePudding user response:

IIUC:

df1[df1.applymap(df2.set_index('id').value.get).eval('abs(id1 - id2)').le(1)]

  id1 id2
0  a1  b1
2  e1  d1

Longer Answer

# Callable I'll need in `applymap`
# it basically translates `df2` into
# a function that returns `'value'`
# when you pass `'id'`
c = df2.set_index('id').value.get

# `applymap` applies a callable to each dataframe cell
df1_applied = df1.applymap(c)
print(df1_applied)

   id1  id2
0   10    9
1    7   11
2   12   11
3    5    8

# `eval` takes a string argument that describes what
# calculation to do.  See docs for more
df1_applied_evaled = df1_applied.eval('abs(id1 - id2)')
print(df1_applied_evaled)

0    1
1    4
2    1
3    3
dtype: int64

# now just boolean slice your way to the end
df1[df1_applied_evaled.le(1)]

  id1 id2
0  a1  b1
2  e1  d1

CodePudding user response:

It's easy and intuitive to do this with datar, a re-imagining of pandas APIs:

>>> from datar.all import f, tibble, left_join, mutate, abs, filter, select
>>> 
>>> df1 = tibble(
...     id1=["a1", "c1", "e1", "g1"],
...     id2=["b1", "d1", "d1", "h1"],
... )
>>> 
>>> df2 = tibble(
...     id=["a1", "b1", "c1", "d1", "e1", "g1", "h1"],
...     value=[10, 9, 7, 11, 12, 5, 8],
... )
>>> 
>>> (
...     df1 
...     >> left_join(df2, by={"id1": f.id})  # get the values of id1
...     >> left_join(df2, by={"id2": f.id})  # get the values of id2
...     >> mutate(diff=abs(f.value_x - f.value_y))  # calculate the diff
...     >> filter(f.diff <= 1)  # filter with diff <= 1
...     >> select(f.id1, f.id2)  # keep only desired columns
... )
       id1      id2
  <object> <object>
0       a1       b1
2       e1       d1
  • Related