How to create new column with multiple conditions in python-CodePudding

I have a df as below format.

    year    type1   type2   price
    2015    apple   natural 40
    2015    apple   organic 35
    2016    apple   natural 44
    2016    apple   organic 40
    2015    banana  natural 20
    2015    banana  organic 15
    2016    banana  natural 20
    2016    banana  organic 18

I need to create a new column price_new when the year, type1 and type2 conditions are met. In other words for the same year and type1 if the type2 is natural then fill the new column with new value or else print the old value.

I tried the below:

df["price_new"] = np.where(((df["year"] == 2015) & (
                    df["type1"] == "apple") & (df["type2"].isin(['natural']))),
                                                    25, df["price"])
df["price_new"] = np.where(((df["year"] == 2016) & (
                    df["type1"] == "apple") & (df["type2"].isin(['natural']))),
                                                    26, df["price"])

df["price_new"] = np.where(((df["year"] == 2015) & (
                    df["type1"] == "apple") & (~df["type2"].isin(['natural']))),
                                                    20, df["price"])

df["price_new"] = np.where(((df["year"] == 2016) & (
                    df["type1"] == "apple") & (~df["type2"].isin(['natural']))),
                                                    22, df["price"])

The output should be like below:

year    type1   type2   price   price_new
2015    apple   natural 40     25
2015    apple   organic 35     20
2016    apple   natural 44     26
2016    apple   organic 40     22
2015    banana  natural 20
2015    banana  organic 15
2016    banana  natural 20
2016    banana  organic 18

However, the values from only the last condition are printed:

 year   type1   type2   price   price_new
    2015    apple   natural 40     40
    2015    apple   organic 35     35
    2016    apple   natural 44     44
    2016    apple   organic 40     22
    2015    banana  natural 20
    2015    banana  organic 15
    2016    banana  natural 20
    2016    banana  organic 18

How could the price_new column get the new values for all conditions
In real data i have more than 10 types in type1 column. Is there an efficient way to write this instead of writing it for each unique value in type1 column?

CodePudding user response：

Answering first question:

First copy all prices into price_new
Then use df["price_new"] as the default value in np.where calls

df["price_new"] = df["price"]
df["price_new"] = np.where(condition, value, df["price_new"])

Answering second question:

You can reduce code by putting all conditions and corresponding values into a suitable data structure (here list of named tuples), and then use it to generate boolean masks (condlist) used by np.select that will compute your result.

Note that I use dicts and dict.setdefault when computing masks so masks that appear more than once are only computed once.

from collections import namedtuple

Condition = namedtuple("Condition", ["year", "type1", "type2"])
Mapping = namedtuple("Mapping", ["condition", "value"])

mappings = [
    Mapping(Condition(year=2015, type1="apple", type2="natural"), value=25),
    Mapping(Condition(year=2016, type1="apple", type2="natural"), value=26),
    Mapping(Condition(year=2015, type1="apple", type2="organic"), value=20),
    Mapping(Condition(year=2016, type1="apple", type2="organic"), value=22),
]

conditions, values = zip(*mappings)

years, types1, types2 = {}, {}, {}
condlist = [
    (
        years.setdefault(year, df["year"] == year)
        & types1.setdefault(type1, df["type1"] == type1)
        & types2.setdefault(type2, df["type2"] == type2)
    )
    for year, type1, type2 in conditions
]

df["price_new"] = np.select(condlist, values, default=df["price"])

Check if it works as expected:

expected = [25, 20, 26, 22, 20, 15, 20, 18]
print(f"{np.all(df['price_new'] == expected) = }")  # True

Old answer:

You can put all your conditions and matching values into a suitable data structure, and then process with computation.

Here I choose a list of tuples (named tuples for clarity) where first element of the tuple is the condition and second element is the corresponding value.

from collections import namedtuple

Mapping = namedtuple("Mapping", ["where", "value"])

mappings = [
    Mapping(
        where=(
            (df["year"] == 2015)
            & (df["type1"] == "apple")
            & (df["type2"].isin(["natural"]))
        ),
        value=25,
    ),
    Mapping(
        where=(
            (df["year"] == 2016)
            & (df["type1"] == "apple")
            & (df["type2"].isin(["natural"]))
        ),
        value=26,
    ),
    Mapping(
        where=(
            (df["year"] == 2015)
            & (df["type1"] == "apple")
            & (~df["type2"].isin(["natural"]))
        ),
        value=20,
    ),
    Mapping(
        where=(
            (df["year"] == 2016)
            & (df["type1"] == "apple")
            & (~df["type2"].isin(["natural"]))
        ),
        value=22,
    ),
]

From there you can either:

Rewrite the previous code as a for loop:

df["price_new_1"] = df["price"]
for where, value in mappings:
    df["price_new_1"] = np.where(where, value, df["price_new_1"])

Use np.select to compute all at once:

condlist, choicelist = zip(*mappings)
df["price_new_2"] = np.select(condlist, choicelist, default=df["price"])

Check if it works as expected:

expected = [25, 20, 26, 22, 20, 15, 20, 18]
print(f"{np.all(df['price_new_1'] == expected) = }")  # True
print(f"{np.all(df['price_new_2'] == expected) = }")  # True