I have a DataFrame as below:
index text_column
0 ,(Unable_to_see),(concern_code),(concern_color),(Unable_to_see)
1 ,Info_concern,Info_concern
2 ,color_Concern,color_Concern,no_category
3 ,reg_Concern,reg_Concern
I am trying to remove duplicates including the source value completely within each row.
I tried this:
df['result'] = [set(x) for x in df['text_column']]
This gives me a list of values without duplicates but with source value, I need the source value to be removed as well.
Desired output:
result
0 (concern_code),(concern_color)
1
2 no_category
3
Any suggestions or advice ?
CodePudding user response:
Version 1: Removing duplicates across all rows:
You can use .drop_duplicates()
with parameter keep=False
after splitting and expanding the substrings by .str.split()
and .explode()
.
Then, regroup the entries into their original rows by .groupby()
on the row index (level 0). Finally, aggregate and join back the substrings of the original same row with .agg()
and ','.join
df['result'] = (df['text_column'].str.split(',')
.explode()
.drop_duplicates(keep=False)
.groupby(level=0).agg(','.join)
)
.drop_duplicates()
with parameter keep=False
ensures to remove duplicates including the source value.
Alternatively, you can also do it with .stack()
in place of .explode()
, as follows:
df['result'] = (df['text_column'].str.split(',', expand=True)
.stack()
.drop_duplicates(keep=False)
.groupby(level=0).agg(','.join)
)
Data Input:
(Added extra test cases from the sample data in question:)
text_column
0 (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)
1 Info_concern,Info_concern
2 color_Concern,color_Concern,no_category
3 reg_Concern,reg_Concern
4 ABCDEFGHIJKL
5 ABCDEFGHIJKL
Result:
print(df)
text_column result
0 (Unable_to_see),(concern_code),(concern_color),(Unable_to_see) (concern_code),(concern_color)
1 Info_concern,Info_concern NaN
2 color_Concern,color_Concern,no_category no_category
3 reg_Concern,reg_Concern NaN
4 ABCDEFGHIJKL NaN
5 ABCDEFGHIJKL NaN
Note the last 2 rows with same strings are removed as duplicates even when they are in different rows.
Version 2: Removing duplicates within the same row only:
If the scope of removing duplicates is limited to only within the same row rather than across all rows, we can achieve this by the following code variation:
df['result'] = (df['text_column'].str.split(',', expand=True)
.stack()
.groupby(level=0)
.agg(lambda x: ','.join(x.drop_duplicates(keep=False)))
)
Data Input:
(Added extra test cases from the sample data in question:)
text_column
0 (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)
1 Info_concern,Info_concern
2 color_Concern,color_Concern,no_category
3 reg_Concern,reg_Concern
4 ABCDEFGHIJKL
5 ABCDEFGHIJKL
Output:
print(df)
text_column result
0 (Unable_to_see),(concern_code),(concern_color),(Unable_to_see) (concern_code),(concern_color)
1 Info_concern,Info_concern
2 color_Concern,color_Concern,no_category no_category
3 reg_Concern,reg_Concern
4 ABCDEFGHIJKL ABCDEFGHIJKL
5 ABCDEFGHIJKL ABCDEFGHIJKL
Note the last 2 rows with same strings are kept since they are in different rows.