Home > Net >  New DataFrame column that contains IDs where value is outside bounds?
New DataFrame column that contains IDs where value is outside bounds?


I have the following DataFrame :

data: Dict[str, list[int]] = {
    "x1": [5 , 6, 7, 8, 9],
    "min1": [3 , 3, 3, 3, 3],
    "max1": [8, 8, 8, 8, 8],
    "x2": [0 , 1, 2, 3, 4],
    "min2": [2 , 2, 2, 2, 2],
    "max2": [7, 7, 7, 7, 7],
    "x3": [7 , 6, 7, 6, 7],
    "min3": [1 , 1, 1, 1, 1],
    "max3": [6, 6, 6, 6, 6],
n: int = 3 # number of xi
df: pd.DataFrame = pd.DataFrame(data=data)


   x1  min1  max1  x2  min2  max2  x3  min3  max3
0   5     3     8   0     2     7   7     1     6
1   6     3     8   1     2     7   6     1     6
2   7     3     8   2     2     7   7     1     6
3   8     3     8   3     2     7   6     1     6
4   9     3     8   4     2     7   7     1     6

I would like to add a new column alert to df that contains the IDs i where xi < mini or xi > maxi.

Expected result

   x1  min1  max1  x2  min2  max2  x3  min3  max3   alert
0   5     3     8   0     2     7   7     1     6   "2,3"
1   6     3     8   1     2     7   6     1     6     "2"
2   7     3     8   2     2     7   7     1     6     "3"
3   8     3     8   3     2     7   6     1     6      ""
4   9     3     8   4     2     7   7     1     6   "1,3"

I looked at this answer but could not understand how to apply it to my problem.
Below is my working implementation that I wish to improve.

def f(row: pd.Series) -> str:
    alert: str = ""
    for k in range(1, n 1):
        if row[f"x{k}"] < row[f"min{k}"] or row[f"x{k}"] > row[f"max{k}"]:
            alert  = f"{k}"
    return ",".join(list(alert))

df["alert"] = df.apply(f, axis=1)

CodePudding user response:

Actually given your output as strings, your approach isn't too bad. I would just suggest making alert a list, not a string:

def f(row: pd.Series) -> str:
    alert: list = []
    for k in range(1, n 1):
        if row[f"x{k}"] < row[f"min{k}"] or row[f"x{k}"] > row[f"max{k}"]:

    return ",".join(alert)

In a bit fancy way, you can do:

xs = df.filter(regex='^x')
mins = df.filter(like='min').to_numpy()
maxes = df.filter(like='max').to_numpy()

mask = (xs < mins) | (xs > maxes)
df['alert'] = ( mask @ xs.columns.str.replace('x',',')).str.replace('^,','')

CodePudding user response:

We can groupby to dataframe along columns according to integer it contains

df['alert'] = (df.groupby(df.columns.str.extract('(\d )$')[0].tolist(), axis=1)
               .apply(lambda g: g[f'x{g.name}'].le(g[f'min{g.name}']) | g[f'x{g.name}'].gt(g[f'max{g.name}']))
               .apply(lambda row: ','.join(row.index[row]), axis=1))

   x1  min1  max1  x2  min2  max2  x3  min3  max3 alert
0   5     3     8   0     2     7   7     1     6   2,3
1   6     3     8   1     2     7   6     1     6     2
2   7     3     8   2     2     7   7     1     6   2,3
3   8     3     8   3     2     7   6     1     6
4   9     3     8   4     2     7   7     1     6   1,3

Intermediate result

(df.groupby(df.columns.str.extract('(\d )$')[0].tolist(), axis=1)
 .apply(lambda g: g[f'x{g.name}'].le(g[f'min{g.name}']) | g[f'x{g.name}'].gt(g[f'max{g.name}'])))

       1      2      3
0  False   True   True
1  False   True  False
2  False   True   True
3  False  False  False
4   True  False   True

CodePudding user response:

Using pandas:

a = (pd.wide_to_long(df.reset_index(), ['x', 'min', 'max'],'index', 'alert')
    .loc[lambda x: x['x'].lt(x['min']) | x['x'].gt(x['max'])]
    .groupby('index')['alert'].agg(lambda x: ','.join(x.astype(str))))

   x1  min1  max1  x2  min2  max2  x3  min3  max3 alert
0   5     3     8   0     2     7   7     1     6   2,3
1   6     3     8   1     2     7   6     1     6     2
2   7     3     8   2     2     7   7     1     6     3
3   8     3     8   3     2     7   6     1     6   NaN
4   9     3     8   4     2     7   7     1     6   1,3
  • Related