I have seen this question and I cannot understand what it means. It has something to do with flattening the multiple indexes into a single index after pivoting the dataframe.
I am currently working on image processing and I have a dataframe that has duplicate rows, however, each duplicate row has a different value for the 'subject'.
The Goal:
I want to remove these duplicate rows even though they have a different value, and join all the different values from all the duplicates into a single column i.e. Aortic enlargement|Pulmonary fibrosis|Atelectasis
.
The Question:
This is essentially a multipart question.
How could I achieve the goal stated above?
Could somebody explain in lamen terms the question I have linked to fully understand it?
Extra Information:
I have a mock csv file that you can access if you need to fully understand what I mean.
CodePudding user response:
Consider this dataframe as MRE:
>>> df
image_id class_name
0 47ed17dcb2cbeec15182ed335a8b5a9e Nodule/Mass # <- dup 1
1 47ed17dcb2cbeec15182ed335a8b5a9e Aortic enlargement # <- dup 1
2 47ed17dcb2cbeec15182ed335a8b5a9e Pulmonary fibrosis # <- dup 1
3 7c1add6833d5f0102b0d3619a1682a64 Lung Opacity # <- dup 2
4 7c1add6833d5f0102b0d3619a1682a64 Pulmonary fibrosis # <- dup 2
5 5550a493b1c4554da469a072fdfab974 No finding # <- dup 3
6 5550a493b1c4554da469a072fdfab974 No finding # <- dup 3
To get expect outcome, you need to group rows by image_id
and join all values from class_name
together and separated by ' | '
:
>>> df.groupby('image_id')['class_name'].apply(lambda x: ' | '.join(sorted(set(x))))
image_id
47ed17dcb2cbeec15182ed335a8b5a9e Aortic enlargement | Nodule/Mass | Pulmonary f...
5550a493b1c4554da469a072fdfab974 No finding
7c1add6833d5f0102b0d3619a1682a64 Lung Opacity | Pulmonary fibrosis
Use set
to remove class_name
duplicates for a same image_id
and sorted
to get class_name
lexicographical ordered.
Update
You can use MultiIndex
to show correctly your duplicated rows. Try:
>>> df.set_index(['image_id', 'class_name']).sort_index()
class_id rad_id x_min y_min x_max y_max width height
image_id class_name
000434271f63a053c4128a0ba6352c7f No finding 14 R6 NaN NaN NaN NaN 2336 2836
No finding 14 R2 NaN NaN NaN NaN 2336 2836
No finding 14 R3 NaN NaN NaN NaN 2336 2836
00053190460d56c53cc3e57321387478 No finding 14 R11 NaN NaN NaN NaN 1994 2430
No finding 14 R2 NaN NaN NaN NaN 1994 2430
... ... ... ... ... ... ... ... ...
fff0f82159f9083f3dd1f8967fc54f6a No finding 14 R9 NaN NaN NaN NaN 2048 2500
No finding 14 R14 NaN NaN NaN NaN 2048 2500
fff2025e3c1d6970a8a6ee0404ac6940 No finding 14 R1 NaN NaN NaN NaN 1994 2150
No finding 14 R5 NaN NaN NaN NaN 1994 2150
No finding 14 R2 NaN NaN NaN NaN 1994 2150
[67914 rows x 8 columns]