Home > Mobile >  Dataframe filtering with multiple conditions on different columns
Dataframe filtering with multiple conditions on different columns

Time:12-02

Let's say we have the following dataframe:

data = {'Item':['1', '2', '3', '4', '5'], 
'A':[142, 11, 50, 60, 12], 
'B':[55, 65, 130, 14, 69],
'C':[68, -18, 65, 16, 17],
'D':[60, 0, 150, 170, 130],
'E':[230, 200, 5, 10, 160]}

df = pd.DataFrame(data)

representing different items and the corresponding values related to some of their parameters (e.g. length, width, and so on). However, not all the reported values are acceptable, in fact each item has a different range of allowed values:

A and B go from -100 to 100

C goes from -70 to 70

D and E go from 100 to 300

So, as you can see, some items show values outside limits for just one parameter (e.g. item 2 is outside for parameter D), while others are outside for more than one parameter (e.g. item 1 is outside for both parameters A and D).

I need to analyze these data and get a table reporting: how many items are outside for just one parameter and the name of this parameter how many items are outside for more than one parameter and the name of those parameter To be clearer: I need a simple simple way to know how many items are failed and for which parameters. For example: four items out of five are failed, and 2 of them (item #1 and item#3) for two parameters (item #1 for A and D, item #3 for B and E), while items #2 and #4 are out for one parameter (item #2 for D, item #4 for E)

I have tried to define the following masks:

df_mask_1 = abs(df['A'])>=100
df_mask_2 = abs(df['B'])>=100
df_mask_3 = abs(df['C'])>=70
df_mask_4 = ((df['D']<=110) | (df['D']>=800))
df_mask_5 = ((df['E']<=110) | (df['E']>=800))

to get the filtered dataframe:

filtered_df = df[df_mask_1 & df_mask_2 & df_mask_3 & df_mask_4 & df_mask_5]

but what I obtain is just an empty dataframe. I have also tried with

filtered_df = df.loc[(df_mask_1) & (df_mask_2) & (df_mask_3) & (df_mask_4) & (df_mask_5)]

but the result does not change. Any suggestion?

CodePudding user response:

Use a condition list then flat your dataframe with melt then keep rows where condition is False (~x) then unpivot your dataframe with groupby_apply:

condlist = [df['A'].between(-100, 100),
            df['B'].between(-100, 100),
            df['C'].between(-70, 70),
            df['D'].between(100, 300),
            df['E'].between(100, 300)]

df['Fail'] = pd.concat(condlist, axis=1).melt(ignore_index=False) \
               .loc[lambda x: ~x['value']].groupby(level=0)['variable'].apply(list)


# Output:
  Item    A    B   C    D    E    Fail
0    1  142   55  68   60  230  [A, D]
1    2   11   65 -18    0  200     [D]
2    3   50  130  65  150    5  [B, E]
3    4   60   14  16  170   10     [E]
4    5   12   69  17  130  160     NaN

Note: if your dataframe is large and you only need to display failed items, use : df[df['Fail'].notna()] to filter out your dataframe.

Note 2: variable and value are the default column names when you melt a dataframe.

CodePudding user response:

The result from the example code is what it should be. The code filtered_df = df[df_mask_1 & df_mask_2 & df_mask_3 & df_mask_4 & df_mask_5] applies all masks to the data, hence df_mask_1 and df_mask_2 alone would result in empty table (as there are no items where A and B are both over 100).

If you are looking to create a table with information on how many items have parameters outside limits, I'd suggest doing according to Corralien's answer and summing per row to get the amount of limits broken.

Taking Corralien's answer, then

a = pd.concat([df['A'].between(-100, 100),
           df['B'].between(-100, 100),
           df['C'].between(-70, 70),
           df['D'].between(100, 300),
           df['E'].between(100, 300)], axis=1)

b = a.sum(axis=1)

where b gives per item the amount of limits "broken"

0    3
1    4
2    3
3    4
4    5

CodePudding user response:

Another approach is to use the pandas.DataFrame.agg function like so:

import pandas as pd

# Defining a limit checker function
def limit_check(values, limit):
    return (abs(values) >= limit).sum()

# Applying different limits checks to columns
res = df.agg({
    'A': lambda x: limit_check(x, 100),
    'B': lambda x: limit_check(x, 100),
    'C': lambda x: limit_check(x, 70),
    'D': lambda x: limit_check(x, 800),
    'E': lambda x: limit_check(x, 800),
})
print(res)

For the sample data you provided it will result in

A    1
B    1
C    0
D    0
E    0
dtype: int64

  • Related