Let's say we have the following dataframe:
data = {'Item':['1', '2', '3', '4', '5'],
'A':[142, 11, 50, 60, 12],
'B':[55, 65, 130, 14, 69],
'C':[68, -18, 65, 16, 17],
'D':[60, 0, 150, 170, 130],
'E':[230, 200, 5, 10, 160]}
df = pd.DataFrame(data)
representing different items and the corresponding values related to some of their parameters (e.g. length, width, and so on). However, not all the reported values are acceptable, in fact each item has a different range of allowed values:
A and B go from -100 to 100
C goes from -70 to 70
D and E go from 100 to 300
So, as you can see, some items show values outside limits for just one parameter (e.g. item 2 is outside for parameter D), while others are outside for more than one parameter (e.g. item 1 is outside for both parameters A and D).
I need to analyze these data and get a table reporting: how many items are outside for just one parameter and the name of this parameter how many items are outside for more than one parameter and the name of those parameter To be clearer: I need a simple simple way to know how many items are failed and for which parameters. For example: four items out of five are failed, and 2 of them (item #1 and item#3) for two parameters (item #1 for A and D, item #3 for B and E), while items #2 and #4 are out for one parameter (item #2 for D, item #4 for E)
I have tried to define the following masks:
df_mask_1 = abs(df['A'])>=100
df_mask_2 = abs(df['B'])>=100
df_mask_3 = abs(df['C'])>=70
df_mask_4 = ((df['D']<=110) | (df['D']>=800))
df_mask_5 = ((df['E']<=110) | (df['E']>=800))
to get the filtered dataframe:
filtered_df = df[df_mask_1 & df_mask_2 & df_mask_3 & df_mask_4 & df_mask_5]
but what I obtain is just an empty dataframe. I have also tried with
filtered_df = df.loc[(df_mask_1) & (df_mask_2) & (df_mask_3) & (df_mask_4) & (df_mask_5)]
but the result does not change. Any suggestion?
CodePudding user response:
Use a condition list then flat your dataframe with melt
then keep rows where condition is False (~x
) then unpivot your dataframe with groupby_apply
:
condlist = [df['A'].between(-100, 100),
df['B'].between(-100, 100),
df['C'].between(-70, 70),
df['D'].between(100, 300),
df['E'].between(100, 300)]
df['Fail'] = pd.concat(condlist, axis=1).melt(ignore_index=False) \
.loc[lambda x: ~x['value']].groupby(level=0)['variable'].apply(list)
# Output:
Item A B C D E Fail
0 1 142 55 68 60 230 [A, D]
1 2 11 65 -18 0 200 [D]
2 3 50 130 65 150 5 [B, E]
3 4 60 14 16 170 10 [E]
4 5 12 69 17 130 160 NaN
Note: if your dataframe is large and you only need to display failed items, use : df[df['Fail'].notna()]
to filter out your dataframe.
Note 2: variable
and value
are the default column names when you melt
a dataframe.
CodePudding user response:
The result from the example code is what it should be. The code filtered_df = df[df_mask_1 & df_mask_2 & df_mask_3 & df_mask_4 & df_mask_5]
applies all masks to the data, hence df_mask_1
and df_mask_2
alone would result in empty table (as there are no items where A and B are both over 100).
If you are looking to create a table with information on how many items have parameters outside limits, I'd suggest doing according to Corralien's answer and summing per row to get the amount of limits broken.
Taking Corralien's answer, then
a = pd.concat([df['A'].between(-100, 100),
df['B'].between(-100, 100),
df['C'].between(-70, 70),
df['D'].between(100, 300),
df['E'].between(100, 300)], axis=1)
b = a.sum(axis=1)
where b gives per item the amount of limits "broken"
0 3
1 4
2 3
3 4
4 5
CodePudding user response:
Another approach is to use the pandas.DataFrame.agg function like so:
import pandas as pd
# Defining a limit checker function
def limit_check(values, limit):
return (abs(values) >= limit).sum()
# Applying different limits checks to columns
res = df.agg({
'A': lambda x: limit_check(x, 100),
'B': lambda x: limit_check(x, 100),
'C': lambda x: limit_check(x, 70),
'D': lambda x: limit_check(x, 800),
'E': lambda x: limit_check(x, 800),
})
print(res)
For the sample data you provided it will result in
A 1
B 1
C 0
D 0
E 0
dtype: int64