Home > Blockchain >  How to remove duplicate string from each row in a column
How to remove duplicate string from each row in a column

Time:02-03

I have a column that contains a bunch of 4 digit numbers separated by commas. Some contain duplicate sets of 4 digit numbers. For example, 1 row looks like this:

1400, 1400, 1400, 1455, 1455, 1455, 1670, 1670, 1670

I am trying to change that to this:

1400, 1455, 1670

I want to apply that to all rows within a column. I was able to get this from another question here.

df['ID'] = df['ID'].apply(lambda x: list(set(x)))

However, this is just filtering it down to unique single digits, like this:

1,4,0,5,6,7

How do I adjust the code to make this work?

CodePudding user response:

To keep order of 'ID', you can use np.unique after extracting numbers from string:

df['ID'] = df['ID'].str.findall('\d{4}').map(np.unique).str.join(', ')
print(df)

# Output
0    1400, 1455, 1670
Name: ID, dtype: object

CodePudding user response:

One option is to use map with join. But before that, make sure to split :

df["ID"] = df["ID"].str.split("\s*,\s*").map(set).str.join(", ")

You can modify you code by making an str.split right before you call apply :

df["ID"] = df["ID"].str.split("\s*,\s*").apply(lambda x : ", ".join(set(x)))

NB : Both approaches does not garantee/keep the order of numbers. So if the order is important, @Corralien's answer is what you're looking for.

​ Output :

print(df)
                 ID
0  1455, 1400, 1670
  • Related