How to remove the duplicate string completely from the list using pandas python?-CodePudding

I have a DataFrame as below:

index text_column

   0   ,(Unable_to_see),(concern_code),(concern_color),(Unable_to_see)
   1   ,Info_concern,Info_concern
   2   ,color_Concern,color_Concern,no_category
   3   ,reg_Concern,reg_Concern

I am trying to remove duplicates including the source value completely within each row.

I tried this:

df['result'] = [set(x) for x in df['text_column']]

This gives me a list of values without duplicates but with source value, I need the source value to be removed as well.

Desired output:

     result
    
  0  (concern_code),(concern_color)
  1  
  2  no_category
  3

Any suggestions or advice ?

CodePudding user response：

Version 1: Removing duplicates across all rows:

You can use .drop_duplicates() with parameter keep=False after splitting and expanding the substrings by .str.split() and .explode().

Then, regroup the entries into their original rows by .groupby() on the row index (level 0). Finally, aggregate and join back the substrings of the original same row with .agg() and ','.join

df['result'] = (df['text_column'].str.split(',')
                                 .explode()
                                 .drop_duplicates(keep=False)
                                 .groupby(level=0).agg(','.join)
               )

.drop_duplicates() with parameter keep=False ensures to remove duplicates including the source value.

Alternatively, you can also do it with .stack() in place of .explode(), as follows:

df['result'] = (df['text_column'].str.split(',', expand=True)
                                 .stack()
                                 .drop_duplicates(keep=False)
                                 .groupby(level=0).agg(','.join)
               )

Data Input:

(Added extra test cases from the sample data in question:)

                                                      text_column
0  (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)
1                                       Info_concern,Info_concern
2                         color_Concern,color_Concern,no_category
3                                         reg_Concern,reg_Concern
4                                                    ABCDEFGHIJKL
5                                                    ABCDEFGHIJKL

Result:

print(df)

                                                      text_column                          result
0  (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)  (concern_code),(concern_color)
1                                       Info_concern,Info_concern                             NaN
2                         color_Concern,color_Concern,no_category                     no_category
3                                         reg_Concern,reg_Concern                             NaN
4                                                    ABCDEFGHIJKL                             NaN
5                                                    ABCDEFGHIJKL                             NaN

Note the last 2 rows with same strings are removed as duplicates even when they are in different rows.

Version 2: Removing duplicates within the same row only:

If the scope of removing duplicates is limited to only within the same row rather than across all rows, we can achieve this by the following code variation:

df['result'] = (df['text_column'].str.split(',', expand=True)
                                 .stack()
                                 .groupby(level=0)
                                 .agg(lambda x: ','.join(x.drop_duplicates(keep=False)))
               )

Data Input:

(Added extra test cases from the sample data in question:)

                                                      text_column
0  (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)
1                                       Info_concern,Info_concern
2                         color_Concern,color_Concern,no_category
3                                         reg_Concern,reg_Concern
4                                                    ABCDEFGHIJKL
5                                                    ABCDEFGHIJKL

Output:

print(df)

                                                      text_column                          result
0  (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)  (concern_code),(concern_color)
1                                       Info_concern,Info_concern                                
2                         color_Concern,color_Concern,no_category                     no_category
3                                         reg_Concern,reg_Concern                                
4                                                    ABCDEFGHIJKL                    ABCDEFGHIJKL
5                                                    ABCDEFGHIJKL                    ABCDEFGHIJKL

Note the last 2 rows with same strings are kept since they are in different rows.