Home > other >  Check if a value is present in the row and extract the name of column - Pandas
Check if a value is present in the row and extract the name of column - Pandas

Time:07-08

I have a dataframe as follows:

df = 

        A          col_1     col_45    col_9    col_10
        
1.0     4.0        45.0      NaN       34.9     NaN
2.0     4.0        NaN       NaN       23.4     45.6      
3.0     49.2       10.8      NaN       NaN      23.8

For every row in the dataframe, I want to check if there is any value present for col_1, col_45, col_9 and col_10. If there is a value present, I want to get the number for the name of the column and put in a list/array. For example.

In the first row,there are values present for col_1 and col_9. So I want to get the numbers 1 and 9 from the column names and put in a list/array. This should be done for every row.

Is there an easy way to do this?

CodePudding user response:

For column names, you could use a stack (to get rid of all NaN automatically), then a GroupBy.agg to form the lists:

out = (df.set_index('A', append=True)
         .stack().reset_index(-1)
         .groupby(level=0)['level_2'].agg(list)
       )

output:

1.0     [col_1, col_9]
2.0    [col_9, col_10]
3.0    [col_1, col_10]
Name: level_2, dtype: object

For digits:

out = (df.set_index('A', append=True)
         .stack().reset_index(-1)
         ['level_2'].str.extract('_(\d )', expand=False).astype(int)
         .groupby(level=0).agg(list)
       )

output:

1.0     [1, 9]
2.0    [9, 10]
3.0    [1, 10]
Name: level_2, dtype: object
Alternative

Other approach, you can use a mask and a matrix multiplication:

cols = df.filter(like='col').columns
# Index(['col_1', 'col_45', 'col_9', 'col_10'], dtype='object')
int_cols = cols.str.extract('_(\d )$', expand=False).astype(int)
# Int64Index([1, 45, 9, 10], dtype='int64')

m = df[cols].notna()
#      col_1  col_45  col_9  col_10
# 1.0   True   False   True   False
# 2.0  False   False   True    True
# 3.0   True   False  False    True

out = (m.astype(int).mul(int_cols).where(m).convert_dtypes()
        .stack().groupby(level=0).agg(list)
      )
# 1.0     [1, 9]
# 2.0    [9, 10]
# 3.0    [1, 10]
# dtype: object

CodePudding user response:

Use DataFrame.drop with DataFrame.melt for unpivot, remove missing values, convert columns names to integers, and aggregate lists:

df['new'] = (df.drop('A',1)
               .melt(ignore_index=False, var_name='a')
               .dropna().a
               .str.replace('\D','', regex=True)
               .astype(int)
               .groupby(level=0)
               .agg(list))

print (df)
        A  col_1  col_45  col_9  col_10      new
1.0   4.0   45.0     NaN   34.9     NaN   [1, 9]
2.0   4.0    NaN     NaN   23.4    45.6  [9, 10]
3.0  49.2   10.8     NaN    NaN    23.8  [1, 10]
  • Related