I have a unique dataframe, df
:
name val_1 val_2 val_3 val_4
AAA 1 2 3 11
BBB 2 3 5 9
CCC 6 4 15 10
I need to keep only that name where any right-side val
column increases by 10 from any of the previous val
columns, otherwise drop them.
I know that diff()
and ge()
would be helpful here, but not sure how they would work if the difference isn't about between next-to-next column.
Desired output:
name
AAA #val_4 increases by 10 from val_1
CCC #val_3 increases by 11 from val_2
What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!
CodePudding user response:
Something like this would drop any numbers to the right of the max number per row, and then take the diff between the min/max and look for rows > 10
df.loc[df.iloc[:,1:].apply(lambda x: x[0:x.idxmax() 1]).max(1) - df.iloc[:,1:].apply(lambda x: x[0:x.idxmax() 1]).min(1) >= 10].name.tolist()
Output
['AAA','CCC']
CodePudding user response:
I'm sure, you don't need to create a column first and then filter the data (but I don't know how at the moment), but I guess the df.apply
combined with max(x) - min(x)
is a good way to go here.
df = pd.DataFrame(
{
'name' : ['AAA', 'BBB', 'CCC'],
'val_1' : [1,2,6],
'val_2' : [2,3,4],
'val_3' : [3,5,15],
'val_4' : [11,9,10]
}
)
df.set_index('name', drop=True, inplace=True)
df['rows_to_keep'] = df.apply(lambda x: x[-1] - x[0], axis=1)
df = df[df['rows_to_keep'] >= 10]
CodePudding user response:
Ugly, but I think it is doing what is required. Notice that I changed the second value of val_1 to 20, so that we can ensure we are not getting differences from left to right, instead of only right to left.
import pandas as pd
df = pd.DataFrame(
{
'name' : ['AAA', 'BBB', 'CCC'],
'val_1' : [1, 20, 6],
'val_2' : [2, 3, 4],
'val_3' : [3, 5, 15],
'val_4' : [11, 9, 10]
}
)
df = df.set_index("name")
f = lambda x: [x[1]-x[0], 0][len(x)>2] or \
max( *[ x[-1] - x[i] for i in range(0, len(x)-1)] , *(f(x[:-1]),) )
df = df[df.apply(f, axis=1) >= 10]
print(df.index.to_list())