I would like to clean up a pandas dataframe using conditions that are defined in a dict-like structure. There may be 1 to n conditions.
I looked into using numpy.where() but I do not know how to programmatically create the nested conditions needed.
How can I reach my goal?
Here is some sample code:
import pandas as pd
import numpy as np
d = [
["apple", "square", "green"],
["apple", "round", "blue"],
["orange", "long", "yellow"],
]
df = pd.DataFrame(d, columns=["fruit", "shape", "color"])
# conditions
# change fruit to "blueberry" when shape == "round" AND color == "blue"
# change fruit to "banana" when fruit == "orange" AND shape == "long" AND color == "yeelow"
# change shape to "round" when fruit == "apple" and color == "green"
print(df)
# this works but cannot be abstracted, or can it?
df["fruit"] = np.where(df["shape"] == "round",np.where(df["color"] == "blue","blueberry",df["fruit"],),df["fruit"],)
print(df)
# example for rules, alternatives for formatting are also welcome
rules = [
[
{
"condition": [
{
"shape": "round",
"color": "blue",
}
],
"result": [{"fruit": "blueberry"}],
}
],
[
{
"condition": [
{
"fruit": "orange",
"shape": "long"
}
],
"result": [{"fruit": "banana"}],
}
],
[
{
"condition": [
{
"fruit": "apple",
"color": "green"
}
],
"result": [{"shape": "round"}],
}
],
]
CodePudding user response:
- your example data structure of rules can be used. Code below shows how. However IMHO it has lists where they add no value. I would go with a list of dicts, without embedded lists
- have flattened this to start with to this structure tuple that defined condition and result
[({'shape': 'round', 'color': 'blue'}, ('fruit', 'blueberry')),
({'fruit': 'orange', 'shape': 'long'}, ('fruit', 'banana')),
({'fruit': 'apple', 'color': 'green'}, ('shape', 'round'))]
- then it's a case of building an expression string for pandas query
- filter to rows identified and update defined column to defined value using
loc[]
import pandas as pd
import numpy as np
d = [
["apple", "square", "green"],
["apple", "round", "blue"],
["orange", "long", "yellow"],
]
df = pd.DataFrame(d, columns=["fruit", "shape", "color"])
# example for rules, alternatives for formatting are also welcome
rules = [
[
{
"condition": [
{
"shape": "round",
"color": "blue",
}
],
"result": [{"fruit": "blueberry"}],
}
],
[
{
"condition": [{"fruit": "orange", "shape": "long"}],
"result": [{"fruit": "banana"}],
}
],
[
{
"condition": [{"fruit": "apple", "color": "green"}],
"result": [{"shape": "round"}],
}
],
]
# flatten out all those nested lists in rules data structure
for cond, (col, val) in [
(cond,) tuple(res.items())
for r in rules
for u in r
for cond, res in zip(u["condition"], u["result"])
]:
df.loc[
df.query(" & ".join([f"({c}=='{v}')" for c, v in cond.items()])).index,
col,
] = val
df
fruit | shape | color | |
---|---|---|---|
0 | apple | round | green |
1 | blueberry | round | blue |
2 | banana | long | yellow |