Home > Enterprise >  pandas get first n columns per row
pandas get first n columns per row

Time:08-19

I have the following dataframe:

df1 = pd.DataFrame(data={'1': ['a', 'd', 'g', 'j'], 
                         '2': ['b', 'e', 'h', 'k'], 
                         '3': ['c', 'f', 'i', 'l'],
                         'top_n': [1, 3, 2, 1]},
                   index=pd.Series(['ind1', 'ind2', 'ind3', 'ind4'], name='index'))
>>> df1
       1  2  3   top_n
index
ind1   a  b  c    1
ind2   d  e  f    3
ind3   g  h  i    2
ind4   j  k  l    1

How would I only get the first N values for each row based on the top_n column?

>>> df1
       1  2    3      top_n
index
ind1   a  NaN  NaN     1
ind2   d  e    f       3
ind3   g  h    NaN     2
ind4   j  NaN  NaN     1

In this example, ind3 has g and h because the top_n values is 2.

CodePudding user response:

Using for boolean masking:

import numpy as np

a = np.ones_like(df1).cumsum(1)

mask = ((a <= df1['top_n'].values[:,None]) | (a == df1.shape[1]))

out = df1.where(mask)

output:

       1    2    3  top_n
index                    
ind1   a  NaN  NaN      1
ind2   d    e    f      3
ind3   g    h  NaN      2
ind4   j  NaN  NaN      1

mask:

array([[ True, False, False,  True],
       [ True,  True,  True,  True],
       [ True,  True, False,  True],
       [ True, False, False,  True]])

CodePudding user response:

Let us create a boolean mask by using numpy's broadcasting then use this mask with where to select the top N values in each row

cols = ['1', '2', '3']
mask = df1['top_n'].values[:, None] > range(len(cols))

df1.assign(**df1[cols].where(mask))

       1    2    3  top_n
index                    
ind1   a  NaN  NaN      1
ind2   d    e    f      3
ind3   g    h  NaN      2
ind4   j  NaN  NaN      1

CodePudding user response:

Assuming top_n value will always be equal or greater than the number of other columns, slicing will be enough:

df1.apply(lambda row: row[:row.top_n], axis=1)
  • Related