What is the most efficient way to filter this dataframe based on a list of items?-CodePudding

I have a dataframe with the following format:

ORDER_NUMBER | PRODUCT1 | PRODUCT2  | PRODUCT3  | ...
ORDER1       | 1.0      | 0.0       | 0.0       | ...
ORDER2       | 0.0      | 0.0       | 0.0       | ...
ORDER3       | 0.0      | 1.0       | 0.0       | ...
ORDER4       | 0.0      | 0.0       | 0.0       | ...
ORDER5       | 0.0      | 1.0       | 1.0       | ...
...

In essence, I have one-hot encoded all products per order so that it is clear that, e.g., ORDER1 contains PRODUCT1, but not PRODUCT2 and PRODUCT3, etc.

I also have a list of products that I am interested in, for example:

interesting_products = ['PRODUCT1', 'PRODUCT3']

What I want to do is to keep only the orders (rows) where the products in my interesting_products list appear at least once (i.e., there's a '1.0' instead of '0.0').

So the resulting dataframe should be:

ORDER_NUMBER | PRODUCT1 | PRODUCT2  | PRODUCT3  | ...
ORDER1       | 1.0      | 0.0       | 0.0       | ...
ORDER5       | 0.0      | 1.0       | 1.0       | ...
...

My problem is that that this file is huge (a few GBs). What is the most efficient way of filtering my dataframe?

CodePudding user response：

You can use sum only columns by list and test if equal 1 and filter in boolean indexing:

df = df[df[interesting_products].sum(axis=1).eq(1)]