As input, I have:
- A master list of strings (in this case, the names of API calls) that contains every possible valid string in the input list.
- An input list of strings, each of which is guaranteed to exist in the master list, but may not contain every element of the master list.
As output, I want:
- Process the input list, using
*
as a wildcard, with as few elements in the output list as possible, to list every element of the input without "incorrectly" pulling in additional items. - Note: A wildcard can match zero or more characters.
What approach can I take to implementing this somewhat efficiently on the CPU in Python? I'm going to be handling master_list
and input_list
sizes of O(50), so even if there is a lot of computational complexity, Python can probably chew through it given the smallish sizes of the input lists (I'm hoping).
Note: before invoking the function that would be the answer to this question, I'm going to call .lower()
on all input strings, both in the master list and the input list, so the solution doesn't need to think about casing.
Note(2): Before running the answer to this question, I have already sanitized the input_list
to verify that every element of it exists within master_list
. It is an error to provide input in input_list
that does not exist in master_list
.
Example
master_list = [ "GetFoo", "GetBar", "GetBaz",
"PutFoo", "PutBar", "PutBaz", "PutFooThing", "PutFooOtherThing",
"ViewFoo", "ViewBar", "ViewBarThing", "ViewBaz" ]
input_list = [ "GetBar", "GetBaz", "PutFooOtherThing", "ViewBar", "ViewBarThing" ]
# Do processing here ...
#This is what I expect as output.
correct_output_list = [ "GetB*", "PutFooO*", "ViewBar*" ]
wrong_output_list = [ "Get*", "PutFooOtherThing", "ViewB*" ]
correct_output_list
is correct because:
"GetB*"
refers toGetBar
andGetBaz
frommaster_list
, and both of these are specified ininput_list
."PutFooO*"
only refers toPutFooOtherThing
inmaster_list
, but that's OK -- we still use a wildcard because this is shorter, and therefore saves some bytes (saving bytes is the goal of this function)."ViewBar*"
refers toViewBar
andViewBarThing
, but, crucially, excludesViewBaz
, which is part ofmaster_list
but notinput_list
.
wrong_output_list
is wrong because:
"Get*"
incorrectly pulls inGetFoo
, which is part ofmaster_list
but not ininput_list
."PutFooOtherThing"
doesn't pull in any extra elements ofmaster_list
that are absent ininput_list
, which is "fine" in a sense, but it doesn't take advantage of the opportunistic "compression" of shorteningPutFooOtherThing
toPutFooO*
to save bytes."ViewB*"
is too broad, because it pulls inViewBaz
, which is not specified ininput_list
, only inmaster_list
.
CodePudding user response:
I had to significantly modify the OpenAI Davinci guess, but I managed to get it to pass the input.
#!/usr/bin/env python3
#Test Data
master_list = [ "GetFoo", "GetBar", "GetBaz",
"PutFoo", "PutBar", "PutBaz", "PutFooThing", "PutFooOtherThing",
"ViewFoo", "ViewBar", "ViewBarThing", "ViewBaz" ]
input_list = [ "GetBar", "GetBaz", "PutFooOtherThing", "ViewBar", "ViewBarThing" ]
#Example of correct output given the input_list and master_list
correct_output_list = [ "GetB*", "PutFooO*", "ViewBar*" ]
#Example of wrong output with various problems - for illustration only
wrong_output_list = [ "Get*", "PutFooOtherThing", "ViewB*" ]
#Return whether `stir` is in the list `lst`, ignoring case.
def iic(stir: str, lst: list[str]):
return stir.casefold() in (x.casefold() for x in lst)
#Unit test. Absence of `FAILED` in output indicates the test passed.
def test_result(actual: list[str], expected: list[str]):
actual = cf_list(actual)
expected = cf_list(expected)
for e in actual:
if not iic(e, expected):
print(f"FAILED[EXTRA]: {e}")
for e in expected:
if e not in actual:
print(f"FAILED[MISSING]: {e}")
#Make a prefix dictionary for a given list of string.
def make_pfx_dict(lst: list[str]):
prefix_dict = {}
for string in lst:
for i in range(1,len(string) 1):
prefix = string[:i]
if prefix not in prefix_dict:
prefix_dict[prefix] = []
prefix_dict[prefix].append(string)
return prefix_dict
#Casefold a list
def cf_list(lst: list[str]):
retval = []
for i in lst:
retval.append(i.casefold())
return retval
#Main method that I needed. See requirements.
def process_input(ml, il):
ml = cf_list(ml)
il = cf_list(il)
ml_pfx_dict: dict[str, list[str]] = make_pfx_dict(ml)
il_pfx_dict: dict[str, list[str]] = make_pfx_dict(il)
il = sorted(il, key=len, reverse=True)
output_list = []
for string in il:
#Check if it's already in the output
skip = False
for o in output_list:
if string.startswith(o.replace("*", "")):
skip = True
break
if skip == True:
continue
for i in range(1,len(string) 1):
prefix = string[:i]
if i == len(string):
output_list.append(string)
del ml_pfx_dict[string]
break
if set(il_pfx_dict[prefix]) == set(ml_pfx_dict[prefix]):
output_list.append(prefix '*')
for j in range(i, len(string) 1):
pfx = string[:j]
del ml_pfx_dict[pfx]
break
print(f"output_list = {output_list}")
return output_list
rslt = process_input(master_list, input_list)
print(f"correct_output_list = {cf_list(correct_output_list)}")
test_result(rslt, correct_output_list)
CodePudding user response:
This is not a full answer, but is what OpenAI's Davinci coding system was able to come up with in a few tries. (It's free to use right now!) Some modification to this approach should actually be correct.
'''
As input, I have:
A master list of strings (in this case, the names of API calls) that contains every possible valid string in the input list.
An input list of strings, each of which is guaranteed to exist in the master list, but may not contain every element of the master list.
As output, I want:
Process the input list, using * as a wildcard, with as few elements in the output list as possible, to list every element of the input without "incorrectly" pulling in additional items.
Note: A wildcard can match zero or more characters.
What approach can I take to implementing this somewhat efficiently on the CPU in Python? I'm going to be handling master_list and input_list sizes of O(50), so even if there is a lot of computational complexity, Python can probably chew through it given the smallish sizes of the input lists (I'm hoping).
Note: before invoking the function that would be the answer to this question, I'm going to call .lower() on all input strings, both in the master list and the input list, so the solution doesn't need to think about casing.
Example
master_list = [ "GetFoo", "GetBar", "GetBaz",
"PutFoo", "PutBar", "PutBaz", "PutFooThing", "PutFooOtherThing",
"ViewFoo", "ViewBar", "ViewBarThing", "ViewBaz" ]
input_list = [ "GetBar", "GetBaz", "PutFooOtherThing", "ViewBar", "ViewBarThing" ]
# Do processing here ...
#This is what I expect as output.
correct_output_list = [ "GetB*", "PutFooO*", "ViewBar*" ]
wrong_output_list = [ "Get*", "PutFooOtherThing", "ViewB*" ]
'''
# Solution
def process_input(master_list, input_list):
# Create a dictionary of all the possible prefixes, and the
# list of strings in the master list that begin with that prefix.
prefix_dict = {}
for string in master_list:
for i in range(len(string)):
prefix = string[:i]
if prefix not in prefix_dict:
prefix_dict[prefix] = []
prefix_dict[prefix].append(string)
# Sort the input list by length, longest first.
input_list = sorted(input_list, key=len, reverse=True)
# Initialize our output list.
output_list = []
# Iterate over the input list.
for string in input_list:
# Iterate over the range of possible prefixes.
for i in range(len(string)):
prefix = string[:i]
# If the prefix is a key in the dictionary, and if the
# current string is the only string in the list of strings
# that begin with that prefix, then add the prefix to the
# output list, and remove the prefix from the dictionary.
if prefix in prefix_dict and len(prefix_dict[prefix]) == 1:
output_list.append(prefix '*')
del prefix_dict[prefix]
break
# Return the output list.
return output_list
# Test
master_list = [ "GetFoo", "GetBar", "GetBaz",
"PutFoo", "PutBar", "PutBaz", "PutFooThing", "PutFooOtherThing",
"ViewFoo", "ViewBar", "ViewBarThing", "ViewBaz" ]
input_list = [ "GetBar", "GetBaz", "PutFooOtherThing", "ViewBar", "ViewBarThing" ]
print process_input(master_list, input_list)
Completion started after the def process_input(master_list, input_list):
line, and was done with a temperature of 0.46 (no particular reason for that, just dragged the slider up a bit).
If you fix the final print line to be Python3-compliant, the output from this code is ['PutFooO*', 'ViewBar*']
which is not perfect, but should at least provide a basis for something that works.
EDIT: After actually looking at the above code, it seems to be really bad for doing what you actually want. (As you note, you had to significantly modify the code.)
I also have not looked at the following code, but the explanation of how to solve the problem may be helpful. I followed a practice from this paper, and also used the new insert function instead of completion.
However, it is worth noting that the following code produces the output ['', '', '', '', '']
, which is very much not what you're looking for. So this code likely also needs a lot of fixing.
'''
As input, I have:
A master list of strings (in this case, the names of API calls) that contains every possible valid string in the input list.
An input list of strings, each of which is guaranteed to exist in the master list, but may not contain every element of the master list.
As output, I want:
Process the input list, using * as a wildcard, with as few elements in the output list as possible, to list every element of the input without "incorrectly" pulling in additional items.
Note: A wildcard can match zero or more characters.
What approach can I take to implementing this somewhat efficiently on the CPU in Python? I'm going to be handling master_list and input_list sizes of O(50), so even if there is a lot of computational complexity, Python can probably chew through it given the smallish sizes of the input lists (I'm hoping).
Note: before invoking the function that would be the answer to this question, I'm going to call .lower() on all input strings, both in the master list and the input list, so the solution doesn't need to think about casing.
Example
master_list = [ "GetFoo", "GetBar", "GetBaz",
"PutFoo", "PutBar", "PutBaz", "PutFooThing", "PutFooOtherThing",
"ViewFoo", "ViewBar", "ViewBarThing", "ViewBaz" ]
input_list = [ "GetBar", "GetBaz", "PutFooOtherThing", "ViewBar", "ViewBarThing" ]
# Do processing here ...
#This is what I expect as output.
correct_output_list = [ "GetB*", "PutFooO*", "ViewBar*" ]
wrong_output_list = [ "Get*", "PutFooOtherThing", "ViewB*" ]
Solution
Let's think step by step. How do you find the longest matching prefix for two strings? You could iterate over the characters of both strings, comparing the characters at each index. If they match, compare the next character. If they don't match, you've found your longest matching prefix.
Let's write a function to do this.
def longest_common_prefix(a, b):
'''
a, b: two strings
returns: the longest prefix of a, up to the first character where they differ
'''
longest_prefix = ''
for i, (char_a, char_b) in enumerate(zip(a, b)):
if char_a == char_b:
longest_prefix = char_a
else:
return longest_prefix
return longest_prefix
Now, let's apply this to our problem.
We can work through the master list and the input list, comparing each element of the input list to every element of the master list, and keeping track of the longest matching prefix for each input element.
def longest_common_prefixes(master_list, input_list):
'''
master_list: a master list of strings
input_list: a list of strings
returns: a list of the longest matching prefix for each element of the input list
'''
longest_prefixes = []
for i, input_value in enumerate(input_list):
longest_prefix = input_value
for master_value in master_list:
longest_prefix = longest_common_prefix(longest_prefix, master_value)
longest_prefixes.append(longest_prefix)
return longest_prefixes
This gives us a list of the longest matching prefixes for each element of the input list. But how do we get our final output, which is the shortest list of strings that correctly matches all of the input strings?
In this case, we can use a simple greedy approach. We can iterate through the longest_prefixes list, and for each element, we can see if we can eliminate that element without causing any of the other elements to be incorrectly matched. If so, we can safely remove that element from the list, and continue iterating.
def shortest_matching_prefixes(longest_prefixes):
'''
longest_prefixes: a list of the longest matching prefix for each element of the input list
returns: a list of the shortest matching prefixes for all of the input list
'''
shortest_prefixes = longest_prefixes
while True:
# Iterate over all of the longest prefixes
for longest_prefix in longest_prefixes:
# See if there is a longest_prefix that is a suffix of any other longest prefix
# If so, it can be safely removed
if any(longest_prefix.startswith(other_longest_prefix) and longest_prefix != other_longest_prefix
for other_longest_prefix in longest_prefixes):
shortest_prefixes.remove(longest_prefix)
break
else:
# No longest prefix can be safely removed
return shortest_prefixes
Putting these functions together, we have our solution.
def wildcard_matching(master_list, input_list):
'''
master_list: a master list of strings
input_list: a list of strings
returns: a list of the shortest matching prefixes for all of the input list
'''
longest_prefixes = longest_common_prefixes(master_list, input_list)
shortest_prefixes = shortest_matching_prefixes(longest_prefixes)
return shortest_prefixes
'''
Notes on Time Complexity
The longest_common_prefixes function is linear in the number of elements in the master list, and quadratic with respect to the number of elements in the input list. We can follow the same pattern in the longest_prefixes and shortest_prefixes functions, for a total running time of O(m * n^2), where m is the number of elements in the master list, and n is the number of elements in the input list.
The wildcard_matching function should be replaced with a function that can control the size of the input list. This function would be linear in the number of elements in the master list, and would take some constant amount of time to do processing per element of the input list, giving a total running time of O(m * n).
This solution can be further optimized by using a more complex algorithm to find the longest common prefix of two strings, but the time complexity will remain O(m * n). In this case, it might be better to use a suffix tree, which can be constructed in linear time.
'''