Set value based on if Column A value exists in Column B-CodePudding

I have a dataframe with several columns, and I'm setting up a lambda function to set another column based on a series of logic. The one that I'm stuck on is if Column A's value exists in Column B.

Dataframe looks like this:

Supplier	Department	Option 1	Option 2
Acme	Acme	Monday	Tuesday
Acme	Acme	Wednesday
Acme	HR	Monday	Wednesday
Acme	HR	Tuesday	Wednesday
Acme	HR	Tuesday	Thursday

Etc.

I'm wanting to add a Column with text to keep a certain row or not.

full_quote_df['Keep Row'] = full_quote_df.apply(
                lambda item:
                   'Keep Row' if item.Supplier == item.Department and pd.notna(item['Option 1']) else \ #If supplier and department match, and the have an option, keep
                   'Do Not Keep Row' if item.Supplier == item.Department and pd.isna(item['Option 1'])  else \ #If they leave it blank, exclude it
                   'Keep Row' if item.Supplier != item.Department and item['Option 2'].isin(list(item['Option 1'].values()))  else \ #If the supplier does not match the department, but the second option exists in the first option, keep row
                   'Do Not Keep Row' if item.Supplier != item.Department and item['Option 2'].notin(item['Option 1'])  else \#If the supplier does not match the department, but the second option does not exist in the first option, remove row
                   None,
                axis=1
            )

So my final dataframe should look like this:

Supplier	Department	Option 1	Option 2	Keep Row
Acme	Acme	Monday	Tuesday	Keep
Acme	Acme	Wednesday		Keep
Acme	HR	Monday	Wednesday	Keep
Acme	HR	Tuesday	Wednesday	Keep
Acme	HR	Tuesday	Thursday	Do Not Keep

With my above function, it gives me an error 'str' object has no attribute 'isin'. I don't want to turn the Option 2 column into a list, since I need to evaluate that specific row relative to the entire column, but I'm stumped with what to do next.

CodePudding user response：

You can use numpy.where with pandas.Series.isin :

m = (df["Option 2"].isna()) | (df["Option 2"].isin(df["Option 1"]))
    
df["Keep Row"] = np.where(m, "Keep", "Do Not Keep")

# Output :

print(df)
  Supplier Department   Option 1   Option 2     Keep Row
0     Acme       Acme     Monday    Tuesday         Keep
1     Acme       Acme  Wednesday        NaN         Keep
2     Acme         HR     Monday  Wednesday         Keep
3     Acme         HR    Tuesday  Wednesday         Keep
4     Acme         HR    Tuesday   Thursday  Do Not Keep

CodePudding user response：

You're going to have a much better time with pandas if you switch to thinking about operating on columns (series) as a whole instead of on individual elements.

You can build up Series that hold intermediate results so they aren't recomputed each time, and it can make the logic easier to read also.

supplier_dept_match = full_quote_df['Supplier'] == full_quote_df['Department']
opt1_is_na = full_quote_df['Option 1'].isna()

# Note: the following logic is a bit unclear from your question.
# This approach will look for the option 2 values in the entire option 1 column.
# If you want to look for a certain option 1 subset, then you should clarify how this should behave.
opt2_in_any_opt1 = full_quote_df['Option 2'].isin(full_quote_df['Option 1'])

# Then, you can build up your logic all at once with boolean expressions:
keep = (supplier_dept_match & ~opt1_is_na) | (~supplier_dept_match & opt2_in_any_opt1)
full_quote_df['Keep Row'] = keep.map({True: "Keep", False: "Do Not Keep"})

Hopefully this helps!