Home > Mobile >  How to filter a dataframe and identify records based on a condition on multiple other columns
How to filter a dataframe and identify records based on a condition on multiple other columns

Time:11-26

            id          zone  price
0        0000001           1   33.0
1        0000001           2   24.0
2        0000001           3   34.0
3        0000001           4   45.0
4        0000001           5   51.0

I have the above pandas dataframe, here there are multiple ids (only 1 id is shown here). dataframe consist of a certain id with 5 zones and 5 prices. these prices should follow the below pattern

p1 (price of zone 1) < p2< p3< p4< p5

if anything out of order we should identify and print anomaly records to a file.

here in this example p3 <p4 <p5 but p1 and p2 are erroneous. (p1 > p2 whereas p1 < p2 is expected)

therefore 1st 2 records should be printed to a file

likewise this has to be done to the entire dataframe for all unique ids in it

My dataframe is huge, what is the most efficient way to do this filtering and identify erroneous records?

CodePudding user response:

You can compute the diff per group after sorting the values to ensure the zones are increasing. If the diff is ≤ 0 the price is not strictly increasing and the rows should be flagged:

s = (df.sort_values(by=['id', 'zone']) # sort rows
       .groupby('id')                  # group by id
       ['price'].diff()                # compute the diff
       .le(0)                          # flag those ≤ 0 (not increasing)
     )
df[s|s.shift(-1)]                      # slice flagged rows   previous row

Example output:

   id  zone  price
0   1     1   33.0
1   1     2   24.0

Example input:

   id  zone  price
0   1     1   33.0
1   1     2   24.0
2   1     3   34.0
3   1     4   45.0
4   1     5   51.0
5   2     1   20.0
6   2     2   24.0
7   2     3   34.0
8   2     4   45.0
9   2     5   51.0
saving to file
df[s|s.shift(-1)].to_csv('incorrect_prices.csv')

CodePudding user response:

Another way would be to first sort your dataframe by id and zone in ascending order and compare the next price with previous price using groupby.shift() creating a new column. Then you can just print out the prices that have fell in value:

import numpy as np 
import pandas as pd

df.sort_values(by=['id','zone'],ascending=True)
df['increase'] = np.where(df.zone.eq(1),'no change',
                          np.where(df.groupby('id')['price'].shift(1) < df['price'],'inc','dec'))

>>> df

    id  zone  price   increase
0    1     1     33  no change
1    1     2     24        dec
2    1     3     34        inc
3    1     4     45        inc
4    1     5     51        inc
5    2     1     34  no change
6    2     2     56        inc
7    2     3     22        dec
8    2     4     55        inc
9    2     5     77        inc
10   3     1     44  no change
11   3     2     55        inc
12   3     3     44        dec
13   3     4     66        inc
14   3     5     33        dec

>>> df.loc[df.increase.eq('dec')]

    id  zone  price increase
1    1     2     24      dec
7    2     3     22      dec
12   3     3     44      dec
14   3     5     33      dec

I have added some extra ID's to try and mimic your real data.

  • Related