I have a dataframe as follows:
df =
A col_1 col_45 col_9 col_10
1.0 4.0 45.0 NaN 34.9 NaN
2.0 4.0 NaN NaN 23.4 45.6
3.0 49.2 10.8 NaN NaN 23.8
For every row in the dataframe, I want to check if there is any value present for col_1
, col_45
, col_9
and col_10
. If there is a value present, I want to get the number for the name of the column and put in a list/array. For example.
In the first row,there are values present for col_1
and col_9
. So I want to get the numbers 1
and 9
from the column names and put in a list/array. This should be done for every row.
Is there an easy way to do this?
CodePudding user response:
For column names, you could use a stack
(to get rid of all NaN automatically), then a GroupBy.agg
to form the lists:
out = (df.set_index('A', append=True)
.stack().reset_index(-1)
.groupby(level=0)['level_2'].agg(list)
)
output:
1.0 [col_1, col_9]
2.0 [col_9, col_10]
3.0 [col_1, col_10]
Name: level_2, dtype: object
For digits:
out = (df.set_index('A', append=True)
.stack().reset_index(-1)
['level_2'].str.extract('_(\d )', expand=False).astype(int)
.groupby(level=0).agg(list)
)
output:
1.0 [1, 9]
2.0 [9, 10]
3.0 [1, 10]
Name: level_2, dtype: object
Alternative
Other approach, you can use a mask and a matrix multiplication:
cols = df.filter(like='col').columns
# Index(['col_1', 'col_45', 'col_9', 'col_10'], dtype='object')
int_cols = cols.str.extract('_(\d )$', expand=False).astype(int)
# Int64Index([1, 45, 9, 10], dtype='int64')
m = df[cols].notna()
# col_1 col_45 col_9 col_10
# 1.0 True False True False
# 2.0 False False True True
# 3.0 True False False True
out = (m.astype(int).mul(int_cols).where(m).convert_dtypes()
.stack().groupby(level=0).agg(list)
)
# 1.0 [1, 9]
# 2.0 [9, 10]
# 3.0 [1, 10]
# dtype: object
CodePudding user response:
Use DataFrame.drop
with DataFrame.melt
for unpivot, remove missing values, convert columns names to integers, and aggregate lists
:
df['new'] = (df.drop('A',1)
.melt(ignore_index=False, var_name='a')
.dropna().a
.str.replace('\D','', regex=True)
.astype(int)
.groupby(level=0)
.agg(list))
print (df)
A col_1 col_45 col_9 col_10 new
1.0 4.0 45.0 NaN 34.9 NaN [1, 9]
2.0 4.0 NaN NaN 23.4 45.6 [9, 10]
3.0 49.2 10.8 NaN NaN 23.8 [1, 10]