Home > Software design >  Filter Dataframe Based on Differnce Between Columns [Not Next-to-Next Columns]
Filter Dataframe Based on Differnce Between Columns [Not Next-to-Next Columns]

Time:12-14

I have a unique dataframe, df:

name         val_1       val_2      val_3     val_4

AAA            1          2           3         11       
BBB            2          3           5         9
CCC            6          4           15        10

I need to keep only that name where any right-side val column increases by 10 from any of the previous val columns, otherwise drop them.

I know that diff() and ge() would be helpful here, but not sure how they would work if the difference isn't about between next-to-next column.

Desired output:

name

AAA #val_4 increases by 10 from val_1  
CCC #val_3 increases by 11 from val_2 

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

CodePudding user response:

Something like this would drop any numbers to the right of the max number per row, and then take the diff between the min/max and look for rows > 10

df.loc[df.iloc[:,1:].apply(lambda x: x[0:x.idxmax() 1]).max(1) - df.iloc[:,1:].apply(lambda x: x[0:x.idxmax() 1]).min(1) >= 10].name.tolist()

Output

['AAA','CCC']

CodePudding user response:

I'm sure, you don't need to create a column first and then filter the data (but I don't know how at the moment), but I guess the df.apply combined with max(x) - min(x) is a good way to go here.

df = pd.DataFrame(
    {
        'name' : ['AAA', 'BBB', 'CCC'],
        'val_1' : [1,2,6],
        'val_2' : [2,3,4],
        'val_3' : [3,5,15],
        'val_4' : [11,9,10]
    }
)

df.set_index('name', drop=True, inplace=True)
df['rows_to_keep'] = df.apply(lambda x: x[-1] - x[0], axis=1)
df = df[df['rows_to_keep'] >= 10]

CodePudding user response:

Ugly, but I think it is doing what is required. Notice that I changed the second value of val_1 to 20, so that we can ensure we are not getting differences from left to right, instead of only right to left.

import pandas as pd

df = pd.DataFrame(
    {
        'name' : ['AAA', 'BBB', 'CCC'],
        'val_1' : [1, 20, 6],
        'val_2' : [2, 3, 4],
        'val_3' : [3, 5, 15],
        'val_4' : [11, 9, 10]
    }
)

df = df.set_index("name")

f = lambda x: [x[1]-x[0], 0][len(x)>2] or \
    max( *[ x[-1] - x[i] for i in range(0, len(x)-1)] , *(f(x[:-1]),) )

df = df[df.apply(f, axis=1) >= 10]

print(df.index.to_list())
  • Related