Home > Software engineering >  Is there a simple way to remove duplicate values in certain cells of a dataframe column?
Is there a simple way to remove duplicate values in certain cells of a dataframe column?

Time:03-20

I have a dataframe column with city locations and some of the cells have the same value (city) twice within each cell. I was wondering how to get rid of one of the values. eg. Instead of it saying Dublin Dublin below it will only say Dublin once.

I have tried df['city'].apply(set) but it doesn't give me what I am looking for.

Any advice much appreciated. Please see the image below:

enter image description here

CodePudding user response:

You can split each item by (space) and convert each list of split strings to a set (which is deduplicated, but not sorted), and then re-join:

df['city'] = df['city'].str.split().apply(lambda x: pd.Series(x).drop_duplicates().tolist()).str.join(' ')

Output:

>>> df
             city
0  Los Angeles CA
1            none
2          London
3          Dublin
  • Related