Home > front end >  find duplicates in a string in data frame
find duplicates in a string in data frame

Time:10-05

I have a sample df

id        name
 1        John Walter walter
 2        Adam Smith Smith
 3        Steve Rogers rogers

How can I find whether it is duplicated in every row and pop it out?

id        name                    is_duplicated    poped_out_string     corrected_name
 1        John Walter walter             1               walter             John walter
 2        Adam Smith Smith               1               walter             Adam Smith
 3        Steve Rogers rogers            1               walter             Steve Rogers

CodePudding user response:

df['dup'] = df.apply(lambda x: x.drop_duplicates().to_string(index=False), axis=1)

Assuming that you have set the string in the data frame as df, this code will find whether the value is duplicated in every row and delete it.

CodePudding user response:

One way using more_itertools.unique_everseen

from more_itertools import unique_everseen

def unique(arr, key):
    return " ".join(unique_everseen(arr, key=key))

df["name"].str.split().apply(unique, key=str.lower)

Output:

0     John Walter
1      Adam Smith
2    Steve Rogers
Name: name, dtype: object

If you don't want more_itertools, you can still use unique_everseen from itertools recipes:

from itertools import filterfalse

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

CodePudding user response:

Another way is to use set to de-duplicate and a collections.Counter to get the duplicated values -

df['corrected_name'] = df['name'].str.split().apply(lambda x: ' '.join(set(map(str.lower, x)))).str.title()
df['popped_out_string'] = df['name'].str.split().apply(lambda x: ''.join(k for k, v in Counter(map(str.lower, x)).items() if v > 1))

Output

   id                 name corrected_name popped_out_string
0   1   John Walter walter    John Walter            walter
1   2     Adam Smith Smith     Smith Adam             smith
2   3  Steve Rogers rogers   Rogers Steve            rogers
  • Related