I'm trying to add a column to my df that returns 'yes' if B <A and 'no' if B>= A. However, if either A or B contains a missing value, np.nan should be returned.
My desired output would therefore look something like this:
A | B | is_less |
---|---|---|
np.nan | 10 | np.nan |
10 | np.nan | np.nan |
1 | 5 | no |
5 | 1 | yes |
Problem: my code doesn't return np.nan when required.
What I have tried: option 1:
df['is_less'] = np.where (df['B'] < df['A'], "yes", "no")
df['is_less'] = np.where (df['A'] == np.nan, np.nan, df['is_less'])
df['is_less'] = np.where (df['B'] == np.nan, np.nan, df['is_less'])
Unfortunately np.nans in column A or B are ignored, resulting in incorrect values in 'is_lessl.
Option 2:
def reduced_growth(x):
if (x['A'] == np.nan or x['B']==np.nan):
return np.nan
elif (x['B'] < x['A']):
return "yes"
elif (x['B'] >= x['A']):
return "no"
else:
return "0"
#create new feature using function
df['is_less']= df.apply(reduced_growth, axis=1)
Applying this function results in a mixture of 'yes', 'no' and 0, but no np.nans are returned.
How could I solve this problem?
CodePudding user response:
You can use the new dtypes in pandas (since 1.0) that properly handle missing values:
df = pd.DataFrame({'a': [1, None, 3, 5], 'b': [2, 1, None, 2]})
df = df.convert_dtypes()
df['is_less'] = df['a'] < df['b']
print(df)
See https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data-na
result:
a b is_less
0 1 2 True
1 <NA> 1 <NA>
2 3 <NA> <NA>
3 5 2 False
You can also use pd.array
to directly create a dataframe with the new dtypes:
df = pd.DataFrame({
'a': pd.array([1, None, 3, 5]),
'b': pd.array([2, 1, None, 2]),
})
df['is_less'] = df['a'] < df['b']
print(df)
a b is_less
0 1 2 True
1 <NA> 1 <NA>
2 3 <NA> <NA>
3 5 2 False
CodePudding user response:
You can use a simple comparison and a mask:
df['is_less'] = df['A'].gt(df['B']).mask(df[['A', 'B']].isna().any(1)).map({True: 'yes', False: 'no'})
output:
A B is_less
0 NaN 10.0 NaN
1 10.0 NaN NaN
2 1.0 5.0 no
3 5.0 1.0 yes
CodePudding user response:
Try rewriting your np.where
statement:
df['is_less'] = np.where( (df['A'].isnull()) | (df['B'].isnull() ),np.nan, # check if A or B are np.nan
np.where(df['B'].ge(df['A']),'no','yes')) # check if B >= A
prints:
A B is_less
0 NaN 10.0 nan
1 10.0 NaN nan
2 1.0 5.0 no
3 5.0 1.0 yes
Greater than or equal
CodePudding user response:
You can use np.select to do it using a list of conditions and a corresponding list of values likewise:
conditions = [ df['A'].isnull() | df['B'].isnull() ,
df['B'] < df['A'] ,
df['B'] >= df['A']
]
values = [np.nan, "yes", "no"]
df['is_less'] = np.select(conditions, values, "0")