Home > OS >  searching rows for a value that is 90% of the max value in that row and then outputting the index of
searching rows for a value that is 90% of the max value in that row and then outputting the index of

Time:08-11

I am a python newbie so struggling with this problem.

I am using pandas to read in csv files with multiple rows (changes depending on csv file, up to 200,000) and columns (495).

I want to search along the rows separately to find the max value, then I want to take the value that is 90% of the max and index this to find what column number it is. I want to do this for all rows separately. For example: row 1 has a max value of 12,098 and is in column 300 90% of 12,098 gives a value of 10,888. it is unlikely there will be an exact match, so i want to find the nearest match in that row and then provide me with the column number (index), which could be column 300 for example. I then want to repeat this for every row.

This is what I have done so far: 1.search my rows of data to find the max value,

maxValues = df.max(axis = 1)
  1. calculate 90% of this:

     newmax = maxValues / 10 * 9
    
  2. then find the value closest to that newmax in the row, and then tell me what the column number where that value is - this is the part I can't do. I have tried:

    arr = pulses.to_numpy()
    x =  newmax.values`
    difference_array = np.absolute(arr-x).axis=1
    index = difference_array.argmin().axis=1
    

provides the following error: operands could not be broadcast together with shapes (114,495) (114,)

I can do up to number 2 above, but can't figure out 3. I have tried converting them to arrays as you can see but this only produces errors.

CodePudding user response:

Let's say we have a following dataframe:

import pandas as pd

d= {'a':[0,1], 'b':[10, 20], 'c':[30, 40], 'd':[15, 30]}
df = pd.DataFrame(data=d)
  1. To go row by row you can use apply function
  2. Since you operate with just one row, you can find its maximum with max
  3. To find a closest value to 0.9 of maximum you need to find the smallest abs difference between numbers
  4. To insert values by index of row in initial dataframe use at

So a code would be like this:

percent = 0.9

def foo(row):
    max_val = row.max()
    max_col = row[row==max_val].index[0]
    second_max_val = percent * max_val
    
    idx = row.name
    df.at[idx, 'max'] = max_col
    df.at[idx, '0.9max'] = (abs(row.loc[row.index!=max_col] - second_max_val)).idxmin()
    
    return row


df.apply(lambda row: foo(row), axis=1)

print(df)

CodePudding user response:

Your error occurs because you are comparing a two dimensional array with an one dimensional one (arr - x).

Consider this sample data frame:

import pandas as pd
import numpy as np

N=5
df = pd.DataFrame({
    "col1": np.random.randint(100, size=(N,)),
    "col2": np.random.randint(100, size=(N,)),
    "col3": np.random.randint(100, size=(N,)),
    "col4": np.random.randint(100, size=(N,)),
    "col5": np.random.randint(100, size=(N,))
})
    col1 col2 col3 col4 col5
0   48   21   74   76   95
1   66   1    13   56   83
2   91   67   96   93   28
3   49   76   39   95   84
4   65   31   61   68   24

IIUC, you could use the following code (no iteration needed, relies only on numpy and pandas) to find the index positions of those columns that are closest to the maximum value in each row multiplied by 0.9. If two values are equally close, the first index will be returned. The code only needs about five seconds for 2.mio rows.

Code:

np.argmin(df.sub(df.max(axis=1) * 0.9, axis=0).apply(np.abs).values, axis=1)

Output:

array([3, 4, 0, 4, 2])
  • Related