How can i check if a column has missing values in a condition statement IF-CodePudding

I want to check if a column in my dataframe has a missing value (according to a given condition), if yes i want to replace those missing values with '-'. Here's my code:

 for i in range(len(sample)):
    if sample['label'] != 0 & sample['attack_cat'].isnull() == True:
        sample['attack_cat'] = sample['attack_cat'].fillna('-')
    else:
        sample['attack_cat']

I get this error: in nonzero raise ValueError( ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I checked in debug, it says:

    @final
    def __nonzero__(self):
        raise ValueError(
            f"The truth value of a {type(self).__name__} is ambiguous. "
            "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
        )

Do u have any idea how to solve this, thanks.

CodePudding user response：

You can simply use pandas.DataFrame.loc with a boolean mask :

mask = sample['label'].ne(0) & sample['attack_cat'].isnull()

sample.loc[mask, 'attack_cat'] = '-'

CodePudding user response：

If I'm understanding correctly I think you should just be able to define your condition and use .loc to fill your nulls:

cond = (sample['label'] != 0) & (sample['label'].isnull())

sample.loc[cond, 'attack_cat'] = sample.loc[cond, 'attack_cat'].fillna('-')

But a few things here. If you have multiple conditions you'll need to put them in parenthesis:

(sample['label'] != 0) & (sample['label'].isnull())

rather than

sample['label'] != 0 & sample['label'].isnull()

Also, you don't need isnull = True just isnull()

Also you're iterating through a range of numbers but not really doing anything with them...for i in range(len(sample)): but i doesn't show up anywhere else in your code. If you want to iterate through a dataframe and do something row by row you'll need to do something like

for index, row in sample.iterrows():
    if row['label'].isnull():
        etc...

for i in range(len(sample)):
    if df.iloc[i]['label'].isnull():
        etc...

And lastly, I'm a bit confused on your condition here. You're checking if the values in the column label are not equal to 0 but also if they are null. If sample['label'].isnull() is part of your condition, you don't need the sample['label'] != 0 as part of it as well.

CodePudding user response：

You don't need to iterate through the dataframe to fill the missing values. Here's how you could do it:

sample.loc[(sample['label'] != 0) & (sample['attack_cat'].isna()), 'attack_cat'] = '-'

Full code with sample data


# == Necessary Imports =======================================
from __future__ import annotations # Enables type annotations
import pandas as pd
# Used to generate a random sample to test the code.
import numpy as np

# == Generate Random Sample DataFrame ========================

def generate_sample_dataframe(
    size: int = 20,
    choices_attack_cat: list | None = None,
    choices_label: list | None = None,
) -> pd.DataFrame:
    """
    Generate a sample dataframe with two columns:

        * 'label'
        * 'attack_cat'

    Parameters
    ----------
    size : int, default=20
        The number of rows in the dataframe.
    choices_attack_cat : list | None, optional
        The possible values for the column 'attack_cat'.
        Default labels:
            * True
            * False
            * None

    choices_label : list | None, optional
        The possible values for the column 'label'.
        Default labels:
            * 0
            * 1
            * 2
            * 3
            * 4
            * 5
            * None

    Returns
    -------
    pd.DataFrame
        A dataframe with two columns: 'label' and 'attack_cat'.

    Examples
    --------
    >>> generate_sample_dataframe(size=5)
       label  attack_cat
    0      0        True
    1      0        True
    2      0        True
    3      0        True
    4      0        True

    >>> generate_sample_dataframe(size=5, choices_attack_cat=[True, False])
       label  attack_cat
    0      0        True
    1      0        True
    2      0        True
    3      0        True
    4      0        True

    >>> generate_sample_dataframe(size=5, choices_label=[0, 1, 2])
       label  attack_cat
    0      0        True
    1      0        True
    2      0        True
    3      0        True
    4      0        True

    >>> generate_sample_dataframe(
    ...     size=5,
    ...     choices_attack_cat=[True, False],
    ...     choices_label=[0, 1, 2],
    ... )
       label  attack_cat
    0      0        True
    1      0        True
    2      0        True
    3      0        True
    4      0        True
    """
    if choices_attack_cat is None:
        choices_attack_cat = [True, False, None]
    elif not hasattr(choices_attack_cat, "__iter__") or isinstance(
        choices_attack_cat, str
    ):
        choices_attack_cat = [choices_attack_cat]
    if choices_label is None:
        choices_label = [0, 1, 2, 3, 4, 5, None]
    elif not hasattr(choices_label, "__iter__") or isinstance(
        choices_label, str
    ):
        choices_label = [choices_label]
    return pd.DataFrame(
        {
            "label": np.random.choice(choices_label, size=size),
            "attack_cat": np.random.choice(choices_attack_cat, size=size),
        }
    )


# == Random Sample DataFrame ==================================

sample = generate_sample_dataframe(100)

# == Solution =================================================
# Replace values with "label" different from 0, and
# with missing values for column "attack_cat" with "-"
# Notes:
#  - The `&` operator is the same as `and`. If you want to add an
#    `or` condition, use `|`.
sample.loc[(sample['label'] != 0) & (sample['attack_cat'].isna()), 'attack_cat'] = '-'

Iterating through the dataframe

If you really want to iterate the dataframe, you could use something like this:

for index, row in sample.iterrows():
    if row['label'] != 0 and pd.isna(row['attack_cat']):
        sample.iloc[index]['attack_cat'] = '-'