Home > Back-end >  Pandas: compare two columns where one has NaN
Pandas: compare two columns where one has NaN

Time:06-22

I have the dataframe below:

        Date             A        B
2022-05-01 08:00:00      43       100
2022-05-01 08:01:00      NaN      54
2022-05-01 08:02:00      41       100

I will like to filter the rows if:

  1. Column A is NaN
  2. Column B is < 100

I tried this line of code:

dff = df.loc[((df[df['A'].isna()]) & (df['B'] < 100))]

However I get an error:

TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]

I think the error is pretty clear (I can't compare A with B due to different data type?)

Is there a way to compare the two columns and ideally to add a new column with comments for every filtered rows?

The expected output is:

        Date             A        B       Comments
2022-05-01 08:01:00      NaN      54      Damaged item

Thanks.

CodePudding user response:

Problem in your Code Snippet:

Before moving towards a solution let's discuss the problem in your Code Snippet:

  • The primary reason for Code Breaking is the Return type of your Conditions:

    1. df['A'].isna() will return bool values
    2. [df['B'] < 100] will return a Data
  • So, according to the above condition we can't compare Bool and Float

To know more about pd.dataFrame.isna(): Click Here

Solution :

Now, let's deep dive into the Solution of your given Scenario.

  • The easiest way to solve such an error is that we have to use Conditions in which the return would be the same for both conditions. So, here I have used the lt operator to compare values that return bool based on the condition.

So, for detailed solution refer the Code Snippet mentioned below:

# Import all the important Modules
import pandas as pd
import numpy as np

# Data Regenration
data_df = pd.DataFrame({
    'Date': ['2022-05-01 08:00:00', '2022-05-01 08:01:00', '2022-05-01 08:02:00'],
    'A': [43, np.nan, 41],
    'B': [100, 54, 100]
})

# Condition for 'Damage Items'
conditions = [
    (data_df['A'].isna() & data_df['B'].lt(100))
]

# 'Choice' to return when above 'Condition' is true
choices = ['Damaged item']

# Store Results
data_df['Comments'] = np.select(conditions, choices, default='')
# Output of above 'Code'
    Date                 A        B     Comments
0   2022-05-01 08:00:00  43.0     100   
1   2022-05-01 08:01:00  NaN      54    Damaged item
2   2022-05-01 08:02:00  41.0     100   

To know more about np.select(): Click Here

  • Related