Home > Enterprise >  What is the most efficient way to filter this dataframe based on a list of items?
What is the most efficient way to filter this dataframe based on a list of items?

Time:10-05

I have a dataframe with the following format:

ORDER_NUMBER | PRODUCT1 | PRODUCT2  | PRODUCT3  | ...
ORDER1       | 1.0      | 0.0       | 0.0       | ...
ORDER2       | 0.0      | 0.0       | 0.0       | ...
ORDER3       | 0.0      | 1.0       | 0.0       | ...
ORDER4       | 0.0      | 0.0       | 0.0       | ...
ORDER5       | 0.0      | 1.0       | 1.0       | ...
...

In essence, I have one-hot encoded all products per order so that it is clear that, e.g., ORDER1 contains PRODUCT1, but not PRODUCT2 and PRODUCT3, etc.

I also have a list of products that I am interested in, for example:

interesting_products = ['PRODUCT1', 'PRODUCT3']

What I want to do is to keep only the orders (rows) where the products in my interesting_products list appear at least once (i.e., there's a '1.0' instead of '0.0').

So the resulting dataframe should be:

ORDER_NUMBER | PRODUCT1 | PRODUCT2  | PRODUCT3  | ...
ORDER1       | 1.0      | 0.0       | 0.0       | ...
ORDER5       | 0.0      | 1.0       | 1.0       | ...
...

My problem is that that this file is huge (a few GBs). What is the most efficient way of filtering my dataframe?

CodePudding user response:

You can use sum only columns by list and test if equal 1 and filter in boolean indexing:

df = df[df[interesting_products].sum(axis=1).eq(1)]
  • Related