Home > Back-end >  filter data based on a condition
filter data based on a condition

Time:07-27

I have a data like this

enter image description here

I want to segment/ filter the data into A, B, C buckets based on the column cumulative_percent and condition 0 - ~ 80% of cumulative_percent needs to be in bucket A, ~80 - ~95% in bucket B and ~95% - 100% in bucket C. The problem I'm facing is I dont want to divide materials into two different buckets while creating this filter, I want to filter in the closest percent(whether greater than or lesser than 80% doesn't matter) .

For example,

enter image description here

if I add a filter like


if cumulative_percent <= 79.4086332486936 :
    return 'A'

This would add Material No 901573047 and 913119 in Bucket A, but the rest of material 913119 might have a cumulative_percent of 80.00023232

CodePudding user response:

I would suggest using some reproducible code rather than posting an image as they are not very helpful (see here).

You can filter a DataFrame based on a condition using loc in pandas.

So assuming your DataFrame is called df

df = df.loc[df['cumulative_percent'] < 80, 'Bucket'] = 'A'
df = df.loc[(df['cumulative_percent'] >= 80) & (df['cumulative_percent'] < 95), 'Bucket'] = 'B'
df = df.loc[df['cumulative_percent'] >= 95, 'Bucket'] = 'C'

From there you now have a new column for each of the newly categorized rows. If you are saying that there are rows were the value of cumulative_percent is not correct or does not match other Material No then you would need create a rule to account for how to deal with incorrect values.

If you just want to say for each Material No use the minimum cumulative_percent you could do that something like this:

df['new_cumulative_percent'] = df.groupby('Planning_Material')['cumulative_percent'].transform('min')

And then just filter on the new "corrected" cumulative_percent values

CodePudding user response:

You can use pd.cut to create categories:

df['bucket'] = pd.cut(df['cumulative_percent'], [0, 80, 95, 100], labels=['A', 'B', 'C'])
print(df)

# Output
     cumulative_percent bucket
0              0.477769      A
1              1.019964      A
2              1.582226      A
3              1.808495      A
4              2.450631      A
..                  ...    ...
195          100.000000      C
196          100.000000      C
197          100.000000      C
198          100.000000      C
199          100.000000      C

[200 rows x 2 columns]
  • Related