I have a dataframe with the following format:
ORDER_NUMBER | PRODUCT1 | PRODUCT2 | PRODUCT3 | ...
ORDER1 | 1.0 | 0.0 | 0.0 | ...
ORDER2 | 0.0 | 0.0 | 0.0 | ...
ORDER3 | 0.0 | 1.0 | 0.0 | ...
ORDER4 | 0.0 | 0.0 | 0.0 | ...
ORDER5 | 0.0 | 1.0 | 1.0 | ...
...
In essence, I have one-hot encoded all products per order so that it is clear that, e.g., ORDER1 contains PRODUCT1, but not PRODUCT2 and PRODUCT3, etc.
I also have a list of products that I am interested in, for example:
interesting_products = ['PRODUCT1', 'PRODUCT3']
What I want to do is to keep only the orders (rows) where the products in my interesting_products
list appear at least once (i.e., there's a '1.0' instead of '0.0').
So the resulting dataframe should be:
ORDER_NUMBER | PRODUCT1 | PRODUCT2 | PRODUCT3 | ...
ORDER1 | 1.0 | 0.0 | 0.0 | ...
ORDER5 | 0.0 | 1.0 | 1.0 | ...
...
My problem is that that this file is huge (a few GBs). What is the most efficient way of filtering my dataframe?
CodePudding user response:
You can use sum
only columns by list and test if equal 1
and filter in boolean indexing
:
df = df[df[interesting_products].sum(axis=1).eq(1)]