I have a sample df
id | Another header |
---|---|
1 | JohnWalter walter |
2 | AdamSmith Smith |
3 | Steve Rogers rogers |
How can I find whether it is duplicated in every row and pop it out?
id | Name | poped_out_string | corrected_name |
---|---|---|---|
1 | JohnWalter walter | walter | John walter |
2 | AdamSmith Smith | Smith | Adam Smith |
3 | Steve Rogers rogers | rogers | Steve Rogers |
CodePudding user response:
You could try something like below:
import re # Import to help efficiently find duplicates
# Get unique items from list, and duplicates
def uniqueItems(input):
seen = set()
uniq = []
dups = []
result_dict = {}
for x in input:
xCapitalize = x.capitalize()
if x in uniq and x not in dups:
dups.append(x)
if x not in seen:
uniq.append(xCapitalize)
seen.add(x)
result_dict = {"unique": uniq, "duplicates": dups}
return result_dict
# Split our strings
def splitString(inputString):
stringProcess = re.sub( r"([A-Z])", r" \1", inputString).split()
if (len(stringProcess) > 1): #If there are multiple items in a cell, after splitting
convertToLower = [x.lower() for x in stringProcess] #convert all strings to lower case for easy comparison
uniqueValues = uniqueItems(convertToLower)
else:
result = inputString
return result
# Iterate over rows in data frame
for i, row in df.iterrows():
split_result = splitString(row['Name'])
if (len(split_result["duplicates"]) > 0): # If there are duplicates
df.at[i, "poped_out_string"] = split_result["duplicates"] # Typo here - consider fixing
if (len(split_result["unique"]) > 1):
df.at[i, "corrected_name"] = " ".join(split_result["unique"])
The general idea is to iterate over each row, split the string in the "Name" column, check for duplicates, and then write those values into the data frame
CodePudding user response:
import re
df = pd.DataFrame(['JohnWalter walter brown', 'winter AdamSmith Smith', 'Steve Rogers rogers'], columns=['Name'])
df
Name
0 JohnWalter walter brown
1 winter AdamSmith Smith
2 Steve Rogers rogers
def remove_dups(string):
# first find names that starts with simple/capital leter having one or more characters excluding space and upper cases
names = re.findall('[a-zA-Z][^A-Z ]*', string)
# then take new array to get non-duplicates (set can't use as it doesn't preserve order of the names)
new_names = []
# capitalize and append names if they are not already added
[new_names.append(name.capitalize()) for name in names if name.capitalize() not in new_names]
# finallyconstruct full name and return
return(' '.join(new_names))
df.Name.apply(remove_dups)
0 John Walter Brown
1 Winter Adam Smith
2 Steve Rogers
Name: Name, dtype: object