I have the following dataframe:
df1 = pd.DataFrame(data={'1': ['a', 'd', 'g', 'j'],
'2': ['b', 'e', 'h', 'k'],
'3': ['c', 'f', 'i', 'l'],
'top_n': [1, 3, 2, 1]},
index=pd.Series(['ind1', 'ind2', 'ind3', 'ind4'], name='index'))
>>> df1
1 2 3 top_n
index
ind1 a b c 1
ind2 d e f 3
ind3 g h i 2
ind4 j k l 1
How would I only get the first N values for each row based on the top_n
column?
>>> df1
1 2 3 top_n
index
ind1 a NaN NaN 1
ind2 d e f 3
ind3 g h NaN 2
ind4 j NaN NaN 1
In this example, ind3
has g
and h
because the top_n
values is 2.
CodePudding user response:
Using numpy for boolean masking:
import numpy as np
a = np.ones_like(df1).cumsum(1)
mask = ((a <= df1['top_n'].values[:,None]) | (a == df1.shape[1]))
out = df1.where(mask)
output:
1 2 3 top_n
index
ind1 a NaN NaN 1
ind2 d e f 3
ind3 g h NaN 2
ind4 j NaN NaN 1
mask
:
array([[ True, False, False, True],
[ True, True, True, True],
[ True, True, False, True],
[ True, False, False, True]])
CodePudding user response:
Let us create a boolean mask
by using numpy's broadcasting then use this mask
with where
to select the top N values in each row
cols = ['1', '2', '3']
mask = df1['top_n'].values[:, None] > range(len(cols))
df1.assign(**df1[cols].where(mask))
1 2 3 top_n
index
ind1 a NaN NaN 1
ind2 d e f 3
ind3 g h NaN 2
ind4 j NaN NaN 1
CodePudding user response:
Assuming top_n
value will always be equal or greater than the number of other columns, slicing will be enough:
df1.apply(lambda row: row[:row.top_n], axis=1)