I want to select all the rows of a dataset where exactly M columns satisfy a condition based on N columns (where N >= M).
Consider the following dataset
import numpy as np
import pandas as pd
import random
years = [2000, 2001, 2002]
products = ["A", "B", "C"]
num_combos = 10
years = [random.choice(years) for i in range(num_combos)]
products = [random.choice(products) for i in range(num_combos)]
sum_values = list(range(0, num_combos))
sum_values1 = list(range(0, num_combos))
bools = [bool(random.getrandbits(1)) for i in range(num_combos)]
bool1 = [bool(random.getrandbits(1)) for i in range(num_combos)]
bool2 = [bool(random.getrandbits(1)) for i in range(num_combos)]
bool3 = [bool(random.getrandbits(1)) for i in range(num_combos)]
random.shuffle(sum_values)
av_values = [random.randrange(0, num_combos, 1) for i in range(num_combos)]
cols = {"years": years,
"products": products,
"y0": bools,
"y1": bool1,
"y2": bool2,
"y3": bool3,
"value": av_values}
df = pd.DataFrame(cols)
The code below selects conditions where at least one (or more) of the columns (y0, y1, y2, y3) are True. However, I want to select rows where exactly 2 (analogously 1 or 3) of the columns (y0, y1, y2, y3) are True. Of course one could find all the combinations and use that as a mask, but is there a smarter solution?
cols = ["y0", "y1", "y2", "y3"]
# At least one
df1 = df[(df[cols] == True).any(axis=1)]
# All of the columns
df2 = df[(df[cols] == True).all(axis=1)]
CodePudding user response:
You can use sum
to count the number of True (they are implicitly converted to 1s), and eq
to compare to the desired number:
M = 2
df1 = df[df[cols].sum(axis=1).eq(M)]
NB. If you already have booleans, no need to compare to True
.
Output for M=2
:
years products y0 y1 y2 y3 value
2 2000 A False False True True 3
7 2002 C True False False True 0