For example, with the example dataset below, how do I set a filtering condition using jupyter notebook where I can find out who has 2 or more results of "Number of Candies" more than or equal to 3.1?
ID | Name | Number of Candies |
---|---|---|
1 | Jerry | 3.1 |
1 | Jerry | 3 |
1 | Jerry | 2 |
2 | Tom | 6 |
2 | Tom | 5.8 |
3 | Wendy | 3.1 |
3 | Wendy | 3.2 |
Thus, the output after filtering should be
ID | Name | Number of Candies |
---|---|---|
2 | Tom | 6 |
2 | Tom | 5.8 |
3 | Wendy | 3.1 |
3 | Wendy | 3.2 |
CodePudding user response:
If you can use pandas, we could do this in two steps:
Of course, first make sure you're using a numeric dtype in your "Number of Candies" column:
df["Number of Candies"] = df["Number of Candies"].astype(float)
Filter rows where "Number of Candies" is more than or equal to 3.1
df.loc[df["Number of Candies"] >= 3.1]
Then group by "Name" and keep groups larger than or equal to 2
.groupby("Name").filter(lambda group: len(group) >= 2)
You can put both conditions together and get the result you want in a single line:
>>> df.loc[df["Number of Candies"] >= 3.1].groupby("Name").filter(lambda group: len(group) >= 2)
ID Name Number of Candies
3 2 Tom 6.0
4 2 Tom 5.8
5 3 Wendy 3.1
6 3 Wendy 3.2
CodePudding user response:
You can use a boolean mask computed per group:
# is more than 3.1? / get count of True per group / ≥ 2?
mask = df['Number of Candies'].ge(3.1).groupby(df['ID']).transform('sum').ge(2)
df[mask]
output:
ID Name Number of Candies
3 2 Tom 6.0
4 2 Tom 5.8
5 3 Wendy 3.1
6 3 Wendy 3.2