pandas replace np.nan based on multiple conditions-CodePudding

I'm trying to add a column to my df that returns 'yes' if B <A and 'no' if B>= A. However, if either A or B contains a missing value, np.nan should be returned.

My desired output would therefore look something like this:

A	B	is_less
np.nan	10	np.nan
10	np.nan	np.nan
1	5	no
5	1	yes

Problem: my code doesn't return np.nan when required.

What I have tried: option 1:

df['is_less'] = np.where (df['B'] < df['A'], "yes", "no")
df['is_less'] = np.where (df['A'] == np.nan, np.nan,  df['is_less'])
df['is_less'] = np.where (df['B'] == np.nan, np.nan,  df['is_less'])

Unfortunately np.nans in column A or B are ignored, resulting in incorrect values in 'is_lessl.

Option 2:

def reduced_growth(x):
  if (x['A'] == np.nan or x['B']==np.nan):
    return np.nan
  elif (x['B'] < x['A']):
    return "yes"
  elif (x['B'] >= x['A']): 
    return "no"
  else:
    return "0"

#create new feature using function
df['is_less']= df.apply(reduced_growth, axis=1)

Applying this function results in a mixture of 'yes', 'no' and 0, but no np.nans are returned.

How could I solve this problem?

CodePudding user response：

You can use the new dtypes in pandas (since 1.0) that properly handle missing values:

df = pd.DataFrame({'a': [1, None, 3, 5], 'b': [2, 1, None, 2]})
df = df.convert_dtypes()
df['is_less'] = df['a'] < df['b']

print(df)

See https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data-na

result:

      a     b  is_less
0     1     2     True
1  <NA>     1     <NA>
2     3  <NA>     <NA>
3     5     2    False

You can also use pd.array to directly create a dataframe with the new dtypes:

df = pd.DataFrame({
    'a': pd.array([1, None, 3, 5]),
    'b': pd.array([2, 1, None, 2]),
})

df['is_less'] = df['a'] < df['b']
print(df)

      a     b  is_less
0     1     2     True
1  <NA>     1     <NA>
2     3  <NA>     <NA>
3     5     2    False

CodePudding user response：

You can use a simple comparison and a mask:

df['is_less'] = df['A'].gt(df['B']).mask(df[['A', 'B']].isna().any(1)).map({True: 'yes', False: 'no'})

output:

      A     B is_less
0   NaN  10.0     NaN
1  10.0   NaN     NaN
2   1.0   5.0      no
3   5.0   1.0     yes

CodePudding user response：

Try rewriting your np.where statement:

df['is_less'] = np.where( (df['A'].isnull()) | (df['B'].isnull() ),np.nan, # check if A or B are np.nan
                         np.where(df['B'].ge(df['A']),'no','yes'))        # check if B >= A

prints:

      A     B is_less
0   NaN  10.0     nan
1  10.0   NaN     nan
2   1.0   5.0      no
3   5.0   1.0     yes

Greater than or equal

pandas.ge

CodePudding user response：

You can use np.select to do it using a list of conditions and a corresponding list of values likewise:

conditions = [ df['A'].isnull() | df['B'].isnull() , 
              df['B'] < df['A'] , 
              df['B'] >= df['A'] 
            ]
values = [np.nan, "yes", "no"]

df['is_less'] = np.select(conditions, values, "0")