Pandas: How can I remove duplicates with single column different values, while retaining said differ-CodePudding

I have seen this question and I cannot understand what it means. It has something to do with flattening the multiple indexes into a single index after pivoting the dataframe.

I am currently working on image processing and I have a dataframe that has duplicate rows, however, each duplicate row has a different value for the 'subject'.

The Goal:

I want to remove these duplicate rows even though they have a different value, and join all the different values from all the duplicates into a single column i.e. Aortic enlargement|Pulmonary fibrosis|Atelectasis.

The Question:

This is essentially a multipart question.

How could I achieve the goal stated above?
Could somebody explain in lamen terms the question I have linked to fully understand it?

Extra Information:

I have a mock csv file that you can access if you need to fully understand what I mean.

CodePudding user response：

Consider this dataframe as MRE:

>>> df
                           image_id          class_name
0  47ed17dcb2cbeec15182ed335a8b5a9e         Nodule/Mass  # <- dup 1
1  47ed17dcb2cbeec15182ed335a8b5a9e  Aortic enlargement  # <- dup 1
2  47ed17dcb2cbeec15182ed335a8b5a9e  Pulmonary fibrosis  # <- dup 1
3  7c1add6833d5f0102b0d3619a1682a64        Lung Opacity  # <- dup 2
4  7c1add6833d5f0102b0d3619a1682a64  Pulmonary fibrosis  # <- dup 2
5  5550a493b1c4554da469a072fdfab974          No finding  # <- dup 3
6  5550a493b1c4554da469a072fdfab974          No finding  # <- dup 3

To get expect outcome, you need to group rows by image_id and join all values from class_name together and separated by ' | ':

>>> df.groupby('image_id')['class_name'].apply(lambda x: ' | '.join(sorted(set(x))))

image_id
47ed17dcb2cbeec15182ed335a8b5a9e    Aortic enlargement | Nodule/Mass | Pulmonary f...
5550a493b1c4554da469a072fdfab974                                           No finding
7c1add6833d5f0102b0d3619a1682a64                    Lung Opacity | Pulmonary fibrosis

Use set to remove class_name duplicates for a same image_id and sorted to get class_name lexicographical ordered.

Update

You can use MultiIndex to show correctly your duplicated rows. Try:

>>> df.set_index(['image_id', 'class_name']).sort_index()

                                             class_id rad_id  x_min  y_min  x_max  y_max  width  height
image_id                         class_name
000434271f63a053c4128a0ba6352c7f No finding        14     R6    NaN    NaN    NaN    NaN   2336    2836
                                 No finding        14     R2    NaN    NaN    NaN    NaN   2336    2836
                                 No finding        14     R3    NaN    NaN    NaN    NaN   2336    2836
00053190460d56c53cc3e57321387478 No finding        14    R11    NaN    NaN    NaN    NaN   1994    2430
                                 No finding        14     R2    NaN    NaN    NaN    NaN   1994    2430
...                                               ...    ...    ...    ...    ...    ...    ...     ...
fff0f82159f9083f3dd1f8967fc54f6a No finding        14     R9    NaN    NaN    NaN    NaN   2048    2500
                                 No finding        14    R14    NaN    NaN    NaN    NaN   2048    2500
fff2025e3c1d6970a8a6ee0404ac6940 No finding        14     R1    NaN    NaN    NaN    NaN   1994    2150
                                 No finding        14     R5    NaN    NaN    NaN    NaN   1994    2150
                                 No finding        14     R2    NaN    NaN    NaN    NaN   1994    2150

[67914 rows x 8 columns]