I have a sample df
id name
1 John Walter walter
2 Adam Smith Smith
3 Steve Rogers rogers
How can I find whether it is duplicated in every row and pop it out?
id name is_duplicated poped_out_string corrected_name
1 John Walter walter 1 walter John walter
2 Adam Smith Smith 1 walter Adam Smith
3 Steve Rogers rogers 1 walter Steve Rogers
CodePudding user response:
df['dup'] = df.apply(lambda x: x.drop_duplicates().to_string(index=False), axis=1)
Assuming that you have set the string in the data frame as df
, this code will find whether the value is duplicated in every row and delete it.
CodePudding user response:
One way using more_itertools.unique_everseen
from more_itertools import unique_everseen
def unique(arr, key):
return " ".join(unique_everseen(arr, key=key))
df["name"].str.split().apply(unique, key=str.lower)
Output:
0 John Walter
1 Adam Smith
2 Steve Rogers
Name: name, dtype: object
If you don't want more_itertools
, you can still use unique_everseen
from itertools
recipes:
from itertools import filterfalse
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
CodePudding user response:
Another way is to use set
to de-duplicate and a collections.Counter
to get the duplicated values -
df['corrected_name'] = df['name'].str.split().apply(lambda x: ' '.join(set(map(str.lower, x)))).str.title()
df['popped_out_string'] = df['name'].str.split().apply(lambda x: ''.join(k for k, v in Counter(map(str.lower, x)).items() if v > 1))
Output
id name corrected_name popped_out_string
0 1 John Walter walter John Walter walter
1 2 Adam Smith Smith Smith Adam smith
2 3 Steve Rogers rogers Rogers Steve rogers