I combined three dataframes that have multiple email
columns, then renamed the columns for ease of scripting. I'm trying to create a master email column based on conditions. If C
is populated, use C
. If not, and B
is populated, use B
. If not, and A
is populated, use A
. For some reason it's only filling in the master column if C
is populated. Even if A
is populated, it's returning a null value.
dfs = [df_A, d365, df_C]
from functools import reduce
dfFinal = reduce(lambda left,right: pd.merge(left,right,on='leadId'), dfs)
import numpy as np
emailconditions = [
dfFinal['EmailC'] is not None,
(dfFinal['EmailC'] is None) & (dfFinal['EmailB'] is not None),
(dfFinal['EmailC'] is None) & (dfFinal['EmailB'] is None)]
emailvalues = [
dfFinal['EmailC'],
dfFinal['EmailB'],
dfFinal['EmailA']]
dfFinal['emailFinal'] = np.select(emailconditions, emailvalues)
CodePudding user response:
Try setting emailconditions
like so:
emailconditions = [
dfFinal['EmailC'].notna(),
dfFinal['EmailC'].isna() & dfFinal['EmailB'].notna(),
dfFinal['EmailC'].isna() & dfFinal['EmailB'].isna()]
Key point is to use notna()
instead of is not None
and isna()
instead of is None
.
CodePudding user response:
Here's an alternative solution, which just grabs the first truthy value out of the columns in whichever order you give priority:
In [3]: df
Out[3]:
a b c
0 x w None
1 None y None
2 k None z
In [4]: order = ("c", "b", "a")
In [5]: df.apply(lambda row: next(row[col] for col in order if row[col]), axis=1)
Out[5]:
0 w
1 y
2 z
dtype: object
If you anticipate having rows where none of the columns have a value, then you'd probably want something like this:
def first_truthy(row, order):
try:
return next(row[col] for col in order if row[col])
except StopIteration:
return None
Output:
In [7]: df
Out[7]:
a b c
0 x w None
1 None y None
2 k None z
3 None None None
In [8]: df.apply(lambda row: first_truthy(row, order), axis=1)
Out[8]:
0 w
1 y
2 z
3 None
dtype: object
Not that this is likely much slower than boolean masking, but (in my opinion) easier to reason about and debug, and doesn't require the extra dependency on numpy
. If you need performance, @richardec's solution is likely vastly superior, though I have not benchmarked our solutions.