can anyone show me the script to filter 2 or more choices within a column ?
For example:
A B C D E F G
1 x x x x x AAA
3 x x x x x BBB
4 x x x x x AAA
1 x x x x x CCC
4 x x x x x CCC
5 x x x x x AAA
I would want to filter "AAA" and "BBB", in one go. I have tried with:
import pandas as pd
import numpy as np
import openpyxl
from numpy.random import choice
df = pd.read_excel('filepath', sheet_name = 'Sheet1')
df_sample = df.loc[df['G'].str.contains("AAA", 'BBB")]
But it only returns column G with "AAA", but "BBB" is not in.
Please guide me on how to do that.
Thanks so much !
CodePudding user response:
You can use contains
with the |
separator :
df[df.G.str.contains("AAA|BBB")]
CodePudding user response:
You are passing two parameters to the contains function, from the documentation.
Series.str.contains(pat, case=True, flags=0, na=None, regex=True)
Your function is equivalent to
df['G'].str.contains(pat="AAA", case="BBB")
The string "BBB"
is treated as True
in python, so your code is equivalent to-
df['G'].str.contains(pat="AAA", case=True)
That's why you're only getting the results for "AAA"
to get the correct solution you can follow @Nathan Furnal's solution
df[df.G.str.contains("AAA|BBB")]
CodePudding user response:
You could use isin
or contains
, but remember:
isin
checks if each value in the column is contained in a list of arbitrary values.isin
works column-wise and is available for all data types.contains
checks if arbitrary values are contained in each value in the column.contains
works element-wise and makes sense only when dealing with strings (or values that can be represented as strings).
Here's an approach using isin
:
choices_to_keep = ["AAA", 'BBB"]
filtered_df = df.G.isin(choices_to_keep)