Home > front end >  Duplicates in a cell str in data frame
Duplicates in a cell str in data frame

Time:10-17

I have a sample df

id Another header
1 JohnWalter walter
2 AdamSmith Smith
3 Steve Rogers rogers

How can I find whether it is duplicated in every row and pop it out?

id Name poped_out_string corrected_name
1 JohnWalter walter walter John walter
2 AdamSmith Smith Smith Adam Smith
3 Steve Rogers rogers rogers Steve Rogers

CodePudding user response:

You could try something like below:

import re # Import to help efficiently find duplicates

# Get unique items from list, and duplicates
def uniqueItems(input):
    seen = set()
    uniq = []
    dups = []
    result_dict = {}
    for x in input:
        xCapitalize = x.capitalize()
        if x in uniq and x not in dups:
            dups.append(x)
        if x not in seen:
            uniq.append(xCapitalize)
            seen.add(x)
    result_dict = {"unique": uniq, "duplicates": dups}
    return result_dict 

# Split our strings
def splitString(inputString):
    stringProcess = re.sub( r"([A-Z])", r" \1", inputString).split()
    if (len(stringProcess) > 1): #If there are multiple items in a cell, after splitting
        convertToLower = [x.lower() for x in stringProcess] #convert all strings to lower case for easy comparison
        uniqueValues = uniqueItems(convertToLower)
    else:
        result =  inputString
    return result

# Iterate over rows in data frame
for i, row in df.iterrows():
    split_result = splitString(row['Name'])
    if (len(split_result["duplicates"]) > 0): # If there are duplicates
        df.at[i, "poped_out_string"] = split_result["duplicates"] # Typo here - consider fixing
    if (len(split_result["unique"]) > 1):
        df.at[i, "corrected_name"] = " ".join(split_result["unique"])

The general idea is to iterate over each row, split the string in the "Name" column, check for duplicates, and then write those values into the data frame

CodePudding user response:

import re
df = pd.DataFrame(['JohnWalter walter brown', 'winter AdamSmith Smith', 'Steve Rogers rogers'], columns=['Name'])
df
    Name
0   JohnWalter walter brown
1   winter AdamSmith Smith
2   Steve Rogers rogers


def remove_dups(string):
  # first find names that starts with simple/capital leter having one or more characters excluding space and upper cases
  names = re.findall('[a-zA-Z][^A-Z ]*', string)
  # then take new array to get non-duplicates (set can't use as it doesn't preserve order of the names)
  new_names = []
  # capitalize and append names if they are not already added
  [new_names.append(name.capitalize()) for name in names if name.capitalize() not in new_names]
  # finallyconstruct full name and return
  return(' '.join(new_names)) 

df.Name.apply(remove_dups)

0    John Walter Brown
1    Winter Adam Smith
2         Steve Rogers
Name: Name, dtype: object
  • Related