I have a column that contains a bunch of 4 digit numbers separated by commas. Some contain duplicate sets of 4 digit numbers. For example, 1 row looks like this:
1400, 1400, 1400, 1455, 1455, 1455, 1670, 1670, 1670
I am trying to change that to this:
1400, 1455, 1670
I want to apply that to all rows within a column. I was able to get this from another question here.
df['ID'] = df['ID'].apply(lambda x: list(set(x)))
However, this is just filtering it down to unique single digits, like this:
1,4,0,5,6,7
How do I adjust the code to make this work?
CodePudding user response:
To keep order of 'ID', you can use np.unique
after extracting numbers from string:
df['ID'] = df['ID'].str.findall('\d{4}').map(np.unique).str.join(', ')
print(df)
# Output
0 1400, 1455, 1670
Name: ID, dtype: object
CodePudding user response:
One option is to use map
with join
. But before that, make sure to split
:
df["ID"] = df["ID"].str.split("\s*,\s*").map(set).str.join(", ")
You can modify you code by making an str.split
right before you call apply
:
df["ID"] = df["ID"].str.split("\s*,\s*").apply(lambda x : ", ".join(set(x)))
NB : Both approaches does not garantee/keep the order of numbers. So if the order is important, @Corralien's answer is what you're looking for.
Output :
print(df)
ID
0 1455, 1400, 1670