I have a df as below format.
year type1 type2 price
2015 apple natural 40
2015 apple organic 35
2016 apple natural 44
2016 apple organic 40
2015 banana natural 20
2015 banana organic 15
2016 banana natural 20
2016 banana organic 18
I need to create a new column price_new
when the year
, type1
and type2
conditions are met. In other words for the same year
and type1
if the type2
is natural
then fill the new column with new value or else print the old value.
I tried the below:
df["price_new"] = np.where(((df["year"] == 2015) & (
df["type1"] == "apple") & (df["type2"].isin(['natural']))),
25, df["price"])
df["price_new"] = np.where(((df["year"] == 2016) & (
df["type1"] == "apple") & (df["type2"].isin(['natural']))),
26, df["price"])
df["price_new"] = np.where(((df["year"] == 2015) & (
df["type1"] == "apple") & (~df["type2"].isin(['natural']))),
20, df["price"])
df["price_new"] = np.where(((df["year"] == 2016) & (
df["type1"] == "apple") & (~df["type2"].isin(['natural']))),
22, df["price"])
The output should be like below:
year type1 type2 price price_new
2015 apple natural 40 25
2015 apple organic 35 20
2016 apple natural 44 26
2016 apple organic 40 22
2015 banana natural 20
2015 banana organic 15
2016 banana natural 20
2016 banana organic 18
However, the values from only the last condition are printed:
year type1 type2 price price_new
2015 apple natural 40 40
2015 apple organic 35 35
2016 apple natural 44 44
2016 apple organic 40 22
2015 banana natural 20
2015 banana organic 15
2016 banana natural 20
2016 banana organic 18
- How could the
price_new
column get the new values for all conditions - In real data i have more than 10 types in
type1
column. Is there an efficient way to write this instead of writing it for each unique value intype1
column?
CodePudding user response:
Answering first question:
- First copy all prices into
price_new
- Then use
df["price_new"]
as the default value innp.where
calls
df["price_new"] = df["price"]
df["price_new"] = np.where(condition, value, df["price_new"])
Answering second question:
You can reduce code by putting all conditions and corresponding values into a suitable data structure (here list of named tuples), and then use it to generate boolean masks (condlist
) used by np.select
that will compute your result.
Note that I use dicts and dict.setdefault
when computing masks so masks that appear more than once are only computed once.
from collections import namedtuple
Condition = namedtuple("Condition", ["year", "type1", "type2"])
Mapping = namedtuple("Mapping", ["condition", "value"])
mappings = [
Mapping(Condition(year=2015, type1="apple", type2="natural"), value=25),
Mapping(Condition(year=2016, type1="apple", type2="natural"), value=26),
Mapping(Condition(year=2015, type1="apple", type2="organic"), value=20),
Mapping(Condition(year=2016, type1="apple", type2="organic"), value=22),
]
conditions, values = zip(*mappings)
years, types1, types2 = {}, {}, {}
condlist = [
(
years.setdefault(year, df["year"] == year)
& types1.setdefault(type1, df["type1"] == type1)
& types2.setdefault(type2, df["type2"] == type2)
)
for year, type1, type2 in conditions
]
df["price_new"] = np.select(condlist, values, default=df["price"])
Check if it works as expected:
expected = [25, 20, 26, 22, 20, 15, 20, 18]
print(f"{np.all(df['price_new'] == expected) = }") # True
Old answer:
You can put all your conditions and matching values into a suitable data structure, and then process with computation.
Here I choose a list of tuples (named tuples for clarity) where first element of the tuple is the condition and second element is the corresponding value.
from collections import namedtuple
Mapping = namedtuple("Mapping", ["where", "value"])
mappings = [
Mapping(
where=(
(df["year"] == 2015)
& (df["type1"] == "apple")
& (df["type2"].isin(["natural"]))
),
value=25,
),
Mapping(
where=(
(df["year"] == 2016)
& (df["type1"] == "apple")
& (df["type2"].isin(["natural"]))
),
value=26,
),
Mapping(
where=(
(df["year"] == 2015)
& (df["type1"] == "apple")
& (~df["type2"].isin(["natural"]))
),
value=20,
),
Mapping(
where=(
(df["year"] == 2016)
& (df["type1"] == "apple")
& (~df["type2"].isin(["natural"]))
),
value=22,
),
]
From there you can either:
- Rewrite the previous code as a for loop:
df["price_new_1"] = df["price"]
for where, value in mappings:
df["price_new_1"] = np.where(where, value, df["price_new_1"])
- Use
np.select
to compute all at once:
condlist, choicelist = zip(*mappings)
df["price_new_2"] = np.select(condlist, choicelist, default=df["price"])
Check if it works as expected:
expected = [25, 20, 26, 22, 20, 15, 20, 18]
print(f"{np.all(df['price_new_1'] == expected) = }") # True
print(f"{np.all(df['price_new_2'] == expected) = }") # True