I have a dataframe with several columns, and I'm setting up a lambda function to set another column based on a series of logic. The one that I'm stuck on is if Column A's value exists in Column B.
Dataframe looks like this:
Supplier | Department | Option 1 | Option 2 |
---|---|---|---|
Acme | Acme | Monday | Tuesday |
Acme | Acme | Wednesday | |
Acme | HR | Monday | Wednesday |
Acme | HR | Tuesday | Wednesday |
Acme | HR | Tuesday | Thursday |
Etc.
I'm wanting to add a Column with text to keep a certain row or not.
full_quote_df['Keep Row'] = full_quote_df.apply(
lambda item:
'Keep Row' if item.Supplier == item.Department and pd.notna(item['Option 1']) else \ #If supplier and department match, and the have an option, keep
'Do Not Keep Row' if item.Supplier == item.Department and pd.isna(item['Option 1']) else \ #If they leave it blank, exclude it
'Keep Row' if item.Supplier != item.Department and item['Option 2'].isin(list(item['Option 1'].values())) else \ #If the supplier does not match the department, but the second option exists in the first option, keep row
'Do Not Keep Row' if item.Supplier != item.Department and item['Option 2'].notin(item['Option 1']) else \#If the supplier does not match the department, but the second option does not exist in the first option, remove row
None,
axis=1
)
So my final dataframe should look like this:
Supplier | Department | Option 1 | Option 2 | Keep Row |
---|---|---|---|---|
Acme | Acme | Monday | Tuesday | Keep |
Acme | Acme | Wednesday | Keep | |
Acme | HR | Monday | Wednesday | Keep |
Acme | HR | Tuesday | Wednesday | Keep |
Acme | HR | Tuesday | Thursday | Do Not Keep |
With my above function, it gives me an error 'str' object has no attribute 'isin'. I don't want to turn the Option 2 column into a list, since I need to evaluate that specific row relative to the entire column, but I'm stumped with what to do next.
CodePudding user response:
You can use numpy.where
with pandas.Series.isin
:
m = (df["Option 2"].isna()) | (df["Option 2"].isin(df["Option 1"]))
df["Keep Row"] = np.where(m, "Keep", "Do Not Keep")
# Output :
print(df)
Supplier Department Option 1 Option 2 Keep Row
0 Acme Acme Monday Tuesday Keep
1 Acme Acme Wednesday NaN Keep
2 Acme HR Monday Wednesday Keep
3 Acme HR Tuesday Wednesday Keep
4 Acme HR Tuesday Thursday Do Not Keep
CodePudding user response:
You're going to have a much better time with pandas
if you switch to thinking about operating on columns (series) as a whole instead of on individual elements.
You can build up Series that hold intermediate results so they aren't recomputed each time, and it can make the logic easier to read also.
supplier_dept_match = full_quote_df['Supplier'] == full_quote_df['Department']
opt1_is_na = full_quote_df['Option 1'].isna()
# Note: the following logic is a bit unclear from your question.
# This approach will look for the option 2 values in the entire option 1 column.
# If you want to look for a certain option 1 subset, then you should clarify how this should behave.
opt2_in_any_opt1 = full_quote_df['Option 2'].isin(full_quote_df['Option 1'])
# Then, you can build up your logic all at once with boolean expressions:
keep = (supplier_dept_match & ~opt1_is_na) | (~supplier_dept_match & opt2_in_any_opt1)
full_quote_df['Keep Row'] = keep.map({True: "Keep", False: "Do Not Keep"})
Hopefully this helps!