I have a column, 'state', that has the values 'failed', 'successful', and two or three other values.
I am trying to create a dataframe with only the rows that contain 'failed' and 'successful' in the 'state' column.
I have implemented the following code:
df = df[df['state'].str.contains('failed' or 'successful', na = False)]
but I am only receiving 'failed' rows, not 'successful'.
Any suggestions? I have used this same format on other datasets with success
CodePudding user response:
because ("failed" or "successful") == "failed"
, check the short circuit behavior doc here.
CodePudding user response:
The issue is that the expression "failed" or "successful"
evaluates to "failed"
since the non-empty string "failed"
is truthy. Read this question to learn why this happens.
What you really need to do is evaluate the column on 2 conditions: str.contains("failed")
and str.contains("successful")
and combine those results together. You can do this using the |
operator on the dataframes.
df[df["state"].str.contains("failed", na=False) | df["state"].str.contains("successful", na=False)]
EDIT: As Henry mentioned below, you can get a more succinct answer using regex with df.str.contains
.
df[df["state"].str.contains("failed|success", na=False)]