Home > Software design >  set top 20% value to 1 of each row and bottom 20% value to -1 and np.nan the other
set top 20% value to 1 of each row and bottom 20% value to -1 and np.nan the other

Time:07-05

suppose I have a dataframe like this

    0  1  2  3  4    5    6    7    8     9
0   1  2  3  4  5  NaN  NaN  NaN  NaN   NaN
1   1  2  3  4  5  6.0  7.0  8.0  9.0  10.0
2  10  9  8  7  6  5.0  4.0  3.0  2.0   1.0

I want to set top 20% value to 1 of each row and bottom 20% value to -1 and np.nan the other.(ignore the nan value)

   0    1   2   3    4   5   6   7    8    9
0 -1  NaN NaN NaN  1.0 NaN NaN NaN  NaN  NaN
1 -1 -1.0 NaN NaN  NaN NaN NaN NaN  1.0  1.0
2  1  1.0 NaN NaN  NaN NaN NaN NaN -1.0 -1.0

now I realize it using for loop but it is really slow when the number of row is large. Is there any faster way to realize it?

CodePudding user response:

Use numpy.select:

#count number of non missing values and then 20%
s = df.count(axis=1).mul(0.2).astype(int).to_numpy()

#get indices for top and bottom 20%
a = df.to_numpy()
m1 = np.argsort(-a) < s[:, None]
m2 = np.argsort(a) < s[:, None]

#set values by conditions
df = pd.DataFrame(np.select([m1, m2], [1, -1], np.nan))
print (df)
     0    1   2   3    4   5   6   7    8    9
0 -1.0  NaN NaN NaN  1.0 NaN NaN NaN  NaN  NaN
1 -1.0 -1.0 NaN NaN  NaN NaN NaN NaN  1.0  1.0
2  1.0  1.0 NaN NaN  NaN NaN NaN NaN -1.0 -1.0

CodePudding user response:

I suggest to use numpy nanpercentile

your example data is :

import pandas as pd
data = np.array([
[1,  2 , 3,  4,  5,  np.nan,  np.nan,  np.nan,  np.nan,   np.nan],
[1,  2,  3,  4,  5,  6.0,  7.0,  8.0,  9.0,  10.0],
[  10,  9,  8,  7,  6,  5.0,  4.0,  3.0,  2.0,   1.0]] )
df = pd.DataFrame(data)

my helper function :

import numpy as np

def max_min(arr: np.array) -> np.array:
    low_perc = np.nanpercentile(arr,20)
    high_perc = np.nanpercentile(arr,80)    
    return -(arr<low_perc).astype(int)   (arr>high_perc).astype(int)

and the way to use it :

df.apply(max_min,axis=1)

that gives :

     0    1   2   3    4   5   6   7    8    9
0 -1.0  NaN NaN NaN  1.0 NaN NaN NaN  NaN  NaN
1 -1.0 -1.0 NaN NaN  NaN NaN NaN NaN  1.0  1.0
2  1.0  1.0 NaN NaN  NaN NaN NaN NaN -1.0 -1.0
  • Related