I've got this data frame. The columns look off but they are ['filename', 'diabetic retinopathy', 'glaucoma', 'others'].
filename diabetic retinopathy glaucoma others
0 c24a1b14d253.jpg 0 1 1
1 9ee905a41651.jpg 0 0 1
2 3f58d128caf6.jpg 0 1 0
3 4ce6599e7b20.jpg 0 0 1
4 0def470360e4.jpg 0 0 1
I would like to create a new column called 'labels' that has lists of strings as its values. For example the entry associated with the first image file (row 0) would be: ['glaucoma', 'others'].
ultimately Id like to have a new df thats just 'filename' and 'labels'. I'm new to pandas, so any help would be appreciated!
CodePudding user response:
Try with apply
:
df["label"] = df.apply(lambda row: df.columns[row.eq(1)].tolist(), axis=1)
>>> df[["filename", "label"]]
filename label
0 c24a1b14d253.jpg [glaucoma, others]
1 9ee905a41651.jpg [others]
2 3f58d128caf6.jpg [glaucoma]
3 4ce6599e7b20.jpg [others]
4 0def470360e4.jpg [others]
CodePudding user response:
Here is a vectorial solution using melt
, loc
, and groupby
agg
:
(df.melt(id_vars='filename', var_name='labels')
.loc[lambda d: d['value'].eq(1)]
.groupby('filename')
.agg({'labels': ','.join}) # use list instead of ','.join to have lists
.reset_index()
)
While it seems this solution involves more computations that the apply
method, it is actually much faster (more that 100 times faster on 10k rows: 1.17 s ± 75 ms
for apply vs 8.51 ms ± 370 µs
for the vectorial pipeline)
output:
filename labels
0 0def470360e4.jpg others
1 3f58d128caf6.jpg glaucoma
2 4ce6599e7b20.jpg others
3 9ee905a41651.jpg others
4 c24a1b14d253.jpg glaucoma,others
CodePudding user response:
You can use the Pandas matrix multiplication .dot()
function on the second columns onwards to create the new column label
, as follows:
df['labels'] = df.iloc[:, 1:].dot(df.columns[1:] ',').str[:-1].str.split(',')
If your dataset is large and performance is a major concern, you can use list comprehension, as follows:
df['labels'] = [df.columns[1:][x].tolist() for x in df.iloc[:, 1:].eq(1).values]
Result:
print(df)
filename diabetic retinopathy glaucoma others labels
0 c24a1b14d253.jpg 0 1 1 [glaucoma, others]
1 9ee905a41651.jpg 0 0 1 [others]
2 3f58d128caf6.jpg 0 1 0 [glaucoma]
3 4ce6599e7b20.jpg 0 0 1 [others]
4 0def470360e4.jpg 0 0 1 [others]
For further creating the new df that just have filename
and labels
columns, you can use:
new_df = df[['filename', 'labels']]
Result:
print(new_df)
filename labels
0 c24a1b14d253.jpg [glaucoma, others]
1 9ee905a41651.jpg [others]
2 3f58d128caf6.jpg [glaucoma]
3 4ce6599e7b20.jpg [others]
4 0def470360e4.jpg [others]