Filter Dataframe Based on Differnce Between Columns [Not Next-to-Next Columns]-CodePudding

I have a unique dataframe, df:

name         val_1       val_2      val_3     val_4

AAA            1          2           3         11       
BBB            2          3           5         9
CCC            6          4           15        10

I need to keep only that name where any right-side val column increases by 10 from any of the previous val columns, otherwise drop them.

I know that diff() and ge() would be helpful here, but not sure how they would work if the difference isn't about between next-to-next column.

Desired output:

name

AAA #val_4 increases by 10 from val_1  
CCC #val_3 increases by 11 from val_2

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

CodePudding user response：

Something like this would drop any numbers to the right of the max number per row, and then take the diff between the min/max and look for rows > 10

df.loc[df.iloc[:,1:].apply(lambda x: x[0:x.idxmax() 1]).max(1) - df.iloc[:,1:].apply(lambda x: x[0:x.idxmax() 1]).min(1) >= 10].name.tolist()

Output

['AAA','CCC']

CodePudding user response：

I'm sure, you don't need to create a column first and then filter the data (but I don't know how at the moment), but I guess the df.apply combined with max(x) - min(x) is a good way to go here.

df = pd.DataFrame(
    {
        'name' : ['AAA', 'BBB', 'CCC'],
        'val_1' : [1,2,6],
        'val_2' : [2,3,4],
        'val_3' : [3,5,15],
        'val_4' : [11,9,10]
    }
)

df.set_index('name', drop=True, inplace=True)
df['rows_to_keep'] = df.apply(lambda x: x[-1] - x[0], axis=1)
df = df[df['rows_to_keep'] >= 10]

CodePudding user response：

Ugly, but I think it is doing what is required. Notice that I changed the second value of val_1 to 20, so that we can ensure we are not getting differences from left to right, instead of only right to left.

import pandas as pd

df = pd.DataFrame(
    {
        'name' : ['AAA', 'BBB', 'CCC'],
        'val_1' : [1, 20, 6],
        'val_2' : [2, 3, 4],
        'val_3' : [3, 5, 15],
        'val_4' : [11, 9, 10]
    }
)

df = df.set_index("name")

f = lambda x: [x[1]-x[0], 0][len(x)>2] or \
    max( *[ x[-1] - x[i] for i in range(0, len(x)-1)] , *(f(x[:-1]),) )

df = df[df.apply(f, axis=1) >= 10]

print(df.index.to_list())