Home > database >  How to use a pd.DataFrame for filtering a different pd.DataFrame
How to use a pd.DataFrame for filtering a different pd.DataFrame

Time:08-31

I have a pandas DataFrame containing data and another DataFrame where each row can be interpreted as a filter for the data:

data_df = pd.DataFrame([{'a':i, 'b':i} for i in range(30)])

filter_df = pd.DataFrame({'a':[3,4,5], 'b0':[5,6,8], 'b1':[15,10,11]})
filter_df
    a   b0  b1
0   3   5   15
1   4   6   10
2   5   8   11

Would mean

pd.concat([
data_df[(data_df.a==3) & data_df.b.between(5,15)],
data_df[(data_df.a==4) & data_df.b.between(6,10)],
data_df[(data_df.a==5) & data_df.b.between(8,11)]
])

Now what I need is a way of applying all these filters to the data_df and have the resulting DataFrame. One way to do this is with apply:

res = filter_df.apply(lambda x: data_df[(data_df.a==x['a']) & data_df.b.between(x['b0'], x['b1'])], axis=1)
res = pd.concat([x for x in res])

Notice that for this to work, I have to concat a list of the results, because the result is a Series containing return value for each line, which might be None, a pd.Series or a pd.DataFrame. Is there a "better" way to do this? I'm hoping for something like .reset_index(), but it seems I'm not able to find the correct way. Also, if there is a more elegant/different way than apply I'd be happy. In reality the data_df will be in hundreds of thousends or millions of rows while the filter_df I expect to be below 1000 rows, but most of the time more than 10, if that makes a difference for performance

CodePudding user response:

You can merge and query:

data_df.merge(filter_df, on='a', how='right').query('b0 <= b <= b1')

Or equivalently, merge and loc filter:

(data_df.merge(filter_df, on='a', how='right')
        .loc[lambda x: x['b'].between(x['b0'], x['b1'])]
)

Output:

   a   b  b0  b1
1  3  13   5  15
2  3   8   5  15
5  4   9   6  10
8  5  10   8  11

CodePudding user response:

You can use boolean indexing:

d = filter_df.set_index('a')

# is "a" in filter_df's a?
m1 = data_df['a'].isin(filter_df['a'])
# is b ≥ the matching b0 value in filter_df?
m2 = data_df['b'].ge(data_df['a'].map(d['b0']))
# is b ≤ the matching b1 value in filter_df?
m3 = data_df['b'].le(data_df['a'].map(d['b1']))

# keep if all conditions are True
data_df[m1&m2&m3]

output:

    a   b
13  3  13
23  3   8
24  4   9
25  5  10
  • Related