In pandas, how do I create a new column that captures a 0 or 1 value as a list of strings?-CodePudding

I've got this data frame. The columns look off but they are ['filename', 'diabetic retinopathy', 'glaucoma', 'others'].

filename        diabetic retinopathy    glaucoma    others
0   c24a1b14d253.jpg    0   1   1
1   9ee905a41651.jpg    0   0   1
2   3f58d128caf6.jpg    0   1   0
3   4ce6599e7b20.jpg    0   0   1
4   0def470360e4.jpg    0   0   1

I would like to create a new column called 'labels' that has lists of strings as its values. For example the entry associated with the first image file (row 0) would be: ['glaucoma', 'others'].

ultimately Id like to have a new df thats just 'filename' and 'labels'. I'm new to pandas, so any help would be appreciated!

CodePudding user response：

Try with apply:

df["label"] = df.apply(lambda row: df.columns[row.eq(1)].tolist(), axis=1)

>>> df[["filename", "label"]]

           filename               label
0  c24a1b14d253.jpg  [glaucoma, others]
1  9ee905a41651.jpg            [others]
2  3f58d128caf6.jpg          [glaucoma]
3  4ce6599e7b20.jpg            [others]
4  0def470360e4.jpg            [others]

CodePudding user response：

Here is a vectorial solution using melt, loc, and groupby agg:

(df.melt(id_vars='filename', var_name='labels')
   .loc[lambda d: d['value'].eq(1)]
   .groupby('filename')
   .agg({'labels': ','.join})  # use list instead of ','.join to have lists
   .reset_index()
)

While it seems this solution involves more computations that the apply method, it is actually much faster (more that 100 times faster on 10k rows: 1.17 s ± 75 ms for apply vs 8.51 ms ± 370 µs for the vectorial pipeline)

output:

           filename           labels
0  0def470360e4.jpg           others
1  3f58d128caf6.jpg         glaucoma
2  4ce6599e7b20.jpg           others
3  9ee905a41651.jpg           others
4  c24a1b14d253.jpg  glaucoma,others

CodePudding user response：

You can use the Pandas matrix multiplication .dot() function on the second columns onwards to create the new column label, as follows:

df['labels'] = df.iloc[:, 1:].dot(df.columns[1:]   ',').str[:-1].str.split(',')

If your dataset is large and performance is a major concern, you can use list comprehension, as follows:

df['labels'] = [df.columns[1:][x].tolist() for x in df.iloc[:, 1:].eq(1).values]

Result:

print(df)

           filename  diabetic retinopathy  glaucoma  others              labels
0  c24a1b14d253.jpg                     0         1       1  [glaucoma, others]
1  9ee905a41651.jpg                     0         0       1            [others]
2  3f58d128caf6.jpg                     0         1       0          [glaucoma]
3  4ce6599e7b20.jpg                     0         0       1            [others]
4  0def470360e4.jpg                     0         0       1            [others]

For further creating the new df that just have filename and labels columns, you can use:

new_df = df[['filename', 'labels']]

Result:

print(new_df)

           filename              labels
0  c24a1b14d253.jpg  [glaucoma, others]
1  9ee905a41651.jpg            [others]
2  3f58d128caf6.jpg          [glaucoma]
3  4ce6599e7b20.jpg            [others]
4  0def470360e4.jpg            [others]