Home > Blockchain >  How to clean up pandas dataframe with 1 to n conditions stored in a dictionary?
How to clean up pandas dataframe with 1 to n conditions stored in a dictionary?

Time:03-31

I would like to clean up a pandas dataframe using conditions that are defined in a dict-like structure. There may be 1 to n conditions.

I looked into using numpy.where() but I do not know how to programmatically create the nested conditions needed.

How can I reach my goal?

Here is some sample code:

import pandas as pd
import numpy as np

d = [
    ["apple", "square", "green"],
    ["apple", "round", "blue"],
    ["orange", "long", "yellow"],
]

df = pd.DataFrame(d, columns=["fruit", "shape", "color"])

# conditions
# change fruit to "blueberry" when shape == "round" AND color == "blue"
# change fruit to "banana" when fruit == "orange" AND shape == "long" AND color == "yeelow"
# change shape to "round" when fruit == "apple" and color == "green"

print(df)

# this works but cannot be abstracted, or can it?
df["fruit"] = np.where(df["shape"] == "round",np.where(df["color"] == "blue","blueberry",df["fruit"],),df["fruit"],)

print(df)

# example for rules, alternatives for formatting are also welcome
rules = [
    [
        {
            "condition": [
                {
                    "shape": "round",
                    "color": "blue",
                }
            ],
            "result": [{"fruit": "blueberry"}],
        }
    ],
    [
        {
            "condition": [
                {
                    "fruit": "orange", 
                    "shape": "long"
                    }
                    ],
            "result": [{"fruit": "banana"}],
        }
    ],
    [
        {
            "condition": [
                {
                    "fruit": "apple", 
                    "color": "green"
                    }
                    ],
            "result": [{"shape": "round"}],
        }
    ],
]

CodePudding user response:

  • your example data structure of rules can be used. Code below shows how. However IMHO it has lists where they add no value. I would go with a list of dicts, without embedded lists
  • have flattened this to start with to this structure tuple that defined condition and result
[({'shape': 'round', 'color': 'blue'}, ('fruit', 'blueberry')),
 ({'fruit': 'orange', 'shape': 'long'}, ('fruit', 'banana')),
 ({'fruit': 'apple', 'color': 'green'}, ('shape', 'round'))]
  • then it's a case of building an expression string for pandas query
  • filter to rows identified and update defined column to defined value using loc[]
import pandas as pd
import numpy as np

d = [
    ["apple", "square", "green"],
    ["apple", "round", "blue"],
    ["orange", "long", "yellow"],
]

df = pd.DataFrame(d, columns=["fruit", "shape", "color"])

# example for rules, alternatives for formatting are also welcome
rules = [
    [
        {
            "condition": [
                {
                    "shape": "round",
                    "color": "blue",
                }
            ],
            "result": [{"fruit": "blueberry"}],
        }
    ],
    [
        {
            "condition": [{"fruit": "orange", "shape": "long"}],
            "result": [{"fruit": "banana"}],
        }
    ],
    [
        {
            "condition": [{"fruit": "apple", "color": "green"}],
            "result": [{"shape": "round"}],
        }
    ],
]

# flatten out all those nested lists in rules data structure
for cond, (col, val) in [
    (cond,)   tuple(res.items())
    for r in rules
    for u in r
    for cond, res in zip(u["condition"], u["result"])
]:
    df.loc[
        df.query(" & ".join([f"({c}=='{v}')" for c, v in cond.items()])).index,
        col,
    ] = val

df
fruit shape color
0 apple round green
1 blueberry round blue
2 banana long yellow
  • Related