Home > OS >  How do I set a cell value based on another cell in same row in Python Using Pandas
How do I set a cell value based on another cell in same row in Python Using Pandas

Time:11-17

I am getting my head around Python and Pandas library and trying out some basics but getting lost in documentation.

I have a Pandas DataFrame

A    B    C    D
1    2    3    4
3    4    1    7
6    9    0    1

... other 10k rows.

I now want to add a column say 'E' and that should read True/False if value of 'D' is in top 10% of the entire column.

One way I tried is to sort descending by column 'D' and then update top 10% rows, in this way I am able to sort but have not yet figured out how to update top 10% rows

also this way alters the original order which isn't desirable.

 df = df.sort_values('D',ascending=False)
 df.iloc[:0, :(df.shape[0]-1)/10, 5] = value    --- this doesn't work.

Just checking if there is a way to achieve this without sorting ? if not, how do I update top 10% rows if they were sorted ?

Thanks

CodePudding user response:

If you don't want to use built-in quantile. Use sort method :-

top_10_pc = int(len(df.index) * 0.1)
min_val = min(df.sort_values(by=['D'], ascending=False)[:top_10_pc]['D'])
df['E'] = df['D'] >= min_val

CodePudding user response:

If need top10 values without sorting with duplicates use np.argsort:

np.random.seed(2021)

df = pd.DataFrame(np.random.randint(30, size=(20, 5)), columns=list('ABCDE'))

n = 10
N = int(len(df.index)*(n/100))
print (N)
2

df['mask'] = np.argsort(np.argsort(-df['E'].to_numpy())) < N
print (df)
     A   B   C   D   E   mask
0   20  21  25   0  13  False
1   22  12  27  29  21   True
2   29  24  12  22   6  False
3    6   6   1   5   7  False
4    1  14   1  28   5  False
5   26   2  16   3  17  False
6   16  18  22  27  20  False
7   29  24   5  17   6  False
8   10  14   7  21   6  False
9    9  21  22  25  18  False
10  10   4  13  10  19  False
11  25  18  26  15   8  False
12  10  12  21  11  19  False
13   1  14  17  25  18  False
14   7  21  19  27  12  False
15  23  19   9   4   9  False
16   7  25   7   7  20  False
17  27  29  11  27  19  False
18  18  14  25  27  18  False
19  21  18  26   0  20   True

If need all top2 values is possible compare with Series.nlargest and Series.isin:

df['mask'] = df['E'].isin(df['E'].nlargest(2))
print (df)
     A   B   C   D   E   mask
0   20  21  25   0  13  False
1   22  12  27  29  21   True
2   29  24  12  22   6  False
3    6   6   1   5   7  False
4    1  14   1  28   5  False
5   26   2  16   3  17  False
6   16  18  22  27  20   True
7   29  24   5  17   6  False
8   10  14   7  21   6  False
9    9  21  22  25  18  False
10  10   4  13  10  19  False
11  25  18  26  15   8  False
12  10  12  21  11  19  False
13   1  14  17  25  18  False
14   7  21  19  27  12  False
15  23  19   9   4   9  False
16   7  25   7   7  20   True
17  27  29  11  27  19  False
18  18  14  25  27  18  False
19  21  18  26   0  20   True
  • Related