Consider this sample data:
str_lst = ['abcdefg','abcdefghi']
I am trying to write a function that will compare these two strings in this list and return the difference, in this case, 'hi'
This attempt failed and simply returned both strings.
def difference(string1, string2):
# Split both strings into list items
string1 = string1.split()
string2 = string2.split()
A = set(string1) # Store all string1 list items in set A
B = set(string2) # Store all string2 list items in set B
str_diff = A.symmetric_difference(B)
# isEmpty = (len(str_diff) == 0)
return str_diff
There are several SO questions claiming to seek this, but they simply return a list of the letters that differ between two strings where, in my case, the strings will have many characters identical at the start and I only want the characters near the end that differ between the two.
Ideas of how to reliably accomplish this? My exact situation would be a list of very similar strings, let's say 10 of them, in which I want to use the first item in the list and compare it against all the others one after the other, placing those differences (i.e. small substrings) into a list for collection.
I appreciate you taking the time to check out my question.
Some hypos:
The strings in my dataset would all have initial characters identical, think, directory paths:
sample_lst = ['c:/universal/bin/library/file_choice1.zip',
'c:/universal/bin/library/file_zebra1.doc',
'c:/universal/bin/library/file_alpha1.xlsx']
Running the ideal function on this list would yield a list with the following strings:
result = ['choice1.zip', 'zebra1.doc', 'alpha1.xlsx']
Thus, these are the strings that remaining when you remove any duplicate characters at the start of all of the three lists items in sample_lst
CodePudding user response:
OK, after you have provided some additional clarification I think I understand what you are looking for. Thanks. So I think that I would break the problem into two steps :
- Find the longest common initial substring of your input strings
- Find the "remainder" of each string after removing the longest common initial substring.
def updateCommonInitialSubstring(s,t):
nchars = min(len(s), len(t))
for i in range(nchars):
if s[i] != t[i]:
return s[:i]
return s[:nchars]
def findLongestCommonInitialSubstring(strings):
common = strings[0]
for s in strings[1:]:
common = updateCommonInitialSubstring(common, s)
return common
def findDistinctSuffixes(strings):
# Find the longest common substring of the strings.
common = findLongestCommonInitialSubstring(strings)
# Find the distinct suffixes of the common substring.
suffixes = []
for s in strings:
suffixes.append(s[len(common):])
return suffixes
With this implementation the following test case passes:
def test_findDistinctSuffixes2():
# Arrange
sample_lst = ['c:/universal/bin/library/file_choice1.zip',
'c:/universal/bin/library/file_zebra1.doc',
'c:/universal/bin/library/file_alpha1.xlsx']
# Act
result = findDistinctSuffixes(sample_lst)
# Assert
assert result == ['choice1.zip', 'zebra1.doc', 'alpha1.xlsx']
which is what I think that you were looking for.
CodePudding user response:
simple solution,
if(len(s) < len(t)):
s, t = t, s
diff = ""
for i in s:
if(i not in t):
diff = i
return diff
Complex solution
s_hash = {}
t_hash = {}
for i in s:
if(i not in s_hash):
s_hash[i] = 1
else:
s_hash[i] = 1
for i in t:
if(i not in t_hash):
t_hash[i] = 1
else:
t_hash[i] = 1
for i in t_hash:
if(i in s_hash and (s_hash[i] != t_hash[i])):
return i
elif(i not in s_hash):
return i