Home > Back-end >  Find shortest wildcards for list of strings against a master list by prefix in Python
Find shortest wildcards for list of strings against a master list by prefix in Python

Time:06-09

As input, I have:

  • A master list of strings (in this case, the names of API calls) that contains every possible valid string in the input list.
  • An input list of strings, each of which is guaranteed to exist in the master list, but may not contain every element of the master list.

As output, I want:

  • Process the input list, using * as a wildcard, with as few elements in the output list as possible, to list every element of the input without "incorrectly" pulling in additional items.
  • Note: A wildcard can match zero or more characters.

What approach can I take to implementing this somewhat efficiently on the CPU in Python? I'm going to be handling master_list and input_list sizes of O(50), so even if there is a lot of computational complexity, Python can probably chew through it given the smallish sizes of the input lists (I'm hoping).

Note: before invoking the function that would be the answer to this question, I'm going to call .lower() on all input strings, both in the master list and the input list, so the solution doesn't need to think about casing.

Note(2): Before running the answer to this question, I have already sanitized the input_list to verify that every element of it exists within master_list. It is an error to provide input in input_list that does not exist in master_list.


Example

master_list = [ "GetFoo", "GetBar", "GetBaz", 
  "PutFoo", "PutBar", "PutBaz", "PutFooThing", "PutFooOtherThing",
  "ViewFoo", "ViewBar", "ViewBarThing", "ViewBaz" ]

input_list = [ "GetBar", "GetBaz", "PutFooOtherThing", "ViewBar", "ViewBarThing" ]

# Do processing here ...

#This is what I expect as output.

correct_output_list = [ "GetB*", "PutFooO*", "ViewBar*" ]

wrong_output_list = [ "Get*", "PutFooOtherThing", "ViewB*" ]

correct_output_list is correct because:

  • "GetB*" refers to GetBar and GetBaz from master_list, and both of these are specified in input_list.
  • "PutFooO*" only refers to PutFooOtherThing in master_list, but that's OK -- we still use a wildcard because this is shorter, and therefore saves some bytes (saving bytes is the goal of this function).
  • "ViewBar*" refers to ViewBar and ViewBarThing, but, crucially, excludes ViewBaz, which is part of master_list but not input_list.

wrong_output_list is wrong because:

  • "Get*" incorrectly pulls in GetFoo, which is part of master_list but not in input_list.
  • "PutFooOtherThing" doesn't pull in any extra elements of master_list that are absent in input_list, which is "fine" in a sense, but it doesn't take advantage of the opportunistic "compression" of shortening PutFooOtherThing to PutFooO* to save bytes.
  • "ViewB*" is too broad, because it pulls in ViewBaz, which is not specified in input_list, only in master_list.

CodePudding user response:

I had to significantly modify the OpenAI Davinci guess, but I managed to get it to pass the input.

#!/usr/bin/env python3

#Test Data
master_list = [ "GetFoo", "GetBar", "GetBaz",
    "PutFoo", "PutBar", "PutBaz", "PutFooThing", "PutFooOtherThing",
    "ViewFoo", "ViewBar", "ViewBarThing", "ViewBaz" ]
input_list = [ "GetBar", "GetBaz", "PutFooOtherThing", "ViewBar", "ViewBarThing" ]
#Example of correct output given the input_list and master_list
correct_output_list = [ "GetB*", "PutFooO*", "ViewBar*" ]
#Example of wrong output with various problems - for illustration only
wrong_output_list = [ "Get*", "PutFooOtherThing", "ViewB*" ]

#Return whether `stir` is in the list `lst`, ignoring case.
def iic(stir: str, lst: list[str]):
    return stir.casefold() in (x.casefold() for x in lst)

#Unit test. Absence of `FAILED` in output indicates the test passed.
def test_result(actual: list[str], expected: list[str]):
    actual = cf_list(actual)
    expected = cf_list(expected)
    for e in actual:
        if not iic(e, expected):
            print(f"FAILED[EXTRA]: {e}")
    for e in expected:
        if e not in actual:
            print(f"FAILED[MISSING]: {e}")

#Make a prefix dictionary for a given list of string.
def make_pfx_dict(lst: list[str]):
    prefix_dict = {}
    for string in lst:
        for i in range(1,len(string) 1):
            prefix = string[:i]
            if prefix not in prefix_dict:
                prefix_dict[prefix] = []
            prefix_dict[prefix].append(string)
    return prefix_dict

#Casefold a list
def cf_list(lst: list[str]):
    retval = []
    for i in lst:
        retval.append(i.casefold())
    return retval

#Main method that I needed. See requirements.
def process_input(ml, il):
    ml = cf_list(ml)
    il = cf_list(il)
    ml_pfx_dict: dict[str, list[str]] = make_pfx_dict(ml)
    il_pfx_dict: dict[str, list[str]] = make_pfx_dict(il)
    
    il = sorted(il, key=len, reverse=True)
    output_list = []
    
    for string in il:
        #Check if it's already in the output
        skip = False
        for o in output_list:
            if string.startswith(o.replace("*", "")):
                skip = True
                break
        if skip == True:
            continue

        for i in range(1,len(string) 1):
            prefix = string[:i]
            if i == len(string):
                output_list.append(string)
                del ml_pfx_dict[string]
                break
            if set(il_pfx_dict[prefix]) == set(ml_pfx_dict[prefix]):
                output_list.append(prefix   '*')
                for j in range(i, len(string) 1):
                    pfx = string[:j]
                    del ml_pfx_dict[pfx]
                break
    
    print(f"output_list = {output_list}")
    return output_list

rslt = process_input(master_list, input_list)
print(f"correct_output_list = {cf_list(correct_output_list)}")
test_result(rslt, correct_output_list)

CodePudding user response:

This is not a full answer, but is what OpenAI's Davinci coding system was able to come up with in a few tries. (It's free to use right now!) Some modification to this approach should actually be correct.

'''
As input, I have:

A master list of strings (in this case, the names of API calls) that contains every possible valid string in the input list.
An input list of strings, each of which is guaranteed to exist in the master list, but may not contain every element of the master list.
As output, I want:

Process the input list, using * as a wildcard, with as few elements in the output list as possible, to list every element of the input without "incorrectly" pulling in additional items.
Note: A wildcard can match zero or more characters.
What approach can I take to implementing this somewhat efficiently on the CPU in Python? I'm going to be handling master_list and input_list sizes of O(50), so even if there is a lot of computational complexity, Python can probably chew through it given the smallish sizes of the input lists (I'm hoping).

Note: before invoking the function that would be the answer to this question, I'm going to call .lower() on all input strings, both in the master list and the input list, so the solution doesn't need to think about casing.

Example
master_list = [ "GetFoo", "GetBar", "GetBaz", 
  "PutFoo", "PutBar", "PutBaz", "PutFooThing", "PutFooOtherThing",
  "ViewFoo", "ViewBar", "ViewBarThing", "ViewBaz" ]

input_list = [ "GetBar", "GetBaz", "PutFooOtherThing", "ViewBar", "ViewBarThing" ]

# Do processing here ...

#This is what I expect as output.

correct_output_list = [ "GetB*", "PutFooO*", "ViewBar*" ]

wrong_output_list = [ "Get*", "PutFooOtherThing", "ViewB*" ]
'''

# Solution
def process_input(master_list, input_list):
    # Create a dictionary of all the possible prefixes, and the
    # list of strings in the master list that begin with that prefix.
    prefix_dict = {}
    for string in master_list:
        for i in range(len(string)):
            prefix = string[:i]
            if prefix not in prefix_dict:
                prefix_dict[prefix] = []
            prefix_dict[prefix].append(string)
    
    # Sort the input list by length, longest first.
    input_list = sorted(input_list, key=len, reverse=True)
    
    # Initialize our output list.
    output_list = []
    
    # Iterate over the input list.
    for string in input_list:
        # Iterate over the range of possible prefixes.
        for i in range(len(string)):
            prefix = string[:i]
            # If the prefix is a key in the dictionary, and if the
            # current string is the only string in the list of strings
            # that begin with that prefix, then add the prefix to the
            # output list, and remove the prefix from the dictionary.
            if prefix in prefix_dict and len(prefix_dict[prefix]) == 1:
                output_list.append(prefix   '*')
                del prefix_dict[prefix]
                break
    
    # Return the output list.
    return output_list

# Test
master_list = [ "GetFoo", "GetBar", "GetBaz", 
  "PutFoo", "PutBar", "PutBaz", "PutFooThing", "PutFooOtherThing",
  "ViewFoo", "ViewBar", "ViewBarThing", "ViewBaz" ]

input_list = [ "GetBar", "GetBaz", "PutFooOtherThing", "ViewBar", "ViewBarThing" ]

print process_input(master_list, input_list)

Completion started after the def process_input(master_list, input_list): line, and was done with a temperature of 0.46 (no particular reason for that, just dragged the slider up a bit).

If you fix the final print line to be Python3-compliant, the output from this code is ['PutFooO*', 'ViewBar*'] which is not perfect, but should at least provide a basis for something that works.

EDIT: After actually looking at the above code, it seems to be really bad for doing what you actually want. (As you note, you had to significantly modify the code.)

I also have not looked at the following code, but the explanation of how to solve the problem may be helpful. I followed a practice from this paper, and also used the new insert function instead of completion.

However, it is worth noting that the following code produces the output ['', '', '', '', ''], which is very much not what you're looking for. So this code likely also needs a lot of fixing.

'''
As input, I have:

A master list of strings (in this case, the names of API calls) that contains every possible valid string in the input list.
An input list of strings, each of which is guaranteed to exist in the master list, but may not contain every element of the master list.
As output, I want:

Process the input list, using * as a wildcard, with as few elements in the output list as possible, to list every element of the input without "incorrectly" pulling in additional items.
Note: A wildcard can match zero or more characters.
What approach can I take to implementing this somewhat efficiently on the CPU in Python? I'm going to be handling master_list and input_list sizes of O(50), so even if there is a lot of computational complexity, Python can probably chew through it given the smallish sizes of the input lists (I'm hoping).

Note: before invoking the function that would be the answer to this question, I'm going to call .lower() on all input strings, both in the master list and the input list, so the solution doesn't need to think about casing.

Example
master_list = [ "GetFoo", "GetBar", "GetBaz", 
  "PutFoo", "PutBar", "PutBaz", "PutFooThing", "PutFooOtherThing",
  "ViewFoo", "ViewBar", "ViewBarThing", "ViewBaz" ]

input_list = [ "GetBar", "GetBaz", "PutFooOtherThing", "ViewBar", "ViewBarThing" ]

# Do processing here ...

#This is what I expect as output.

correct_output_list = [ "GetB*", "PutFooO*", "ViewBar*" ]

wrong_output_list = [ "Get*", "PutFooOtherThing", "ViewB*" ]


Solution
Let's think step by step. How do you find the longest matching prefix for two strings? You could iterate over the characters of both strings, comparing the characters at each index. If they match, compare the next character. If they don't match, you've found your longest matching prefix.

Let's write a function to do this.

def longest_common_prefix(a, b):
    '''
    a, b: two strings
    returns: the longest prefix of a, up to the first character where they differ
    '''
    longest_prefix = ''
    for i, (char_a, char_b) in enumerate(zip(a, b)):
        if char_a == char_b:
            longest_prefix  = char_a
        else:
            return longest_prefix
    return longest_prefix

Now, let's apply this to our problem.

We can work through the master list and the input list, comparing each element of the input list to every element of the master list, and keeping track of the longest matching prefix for each input element.

def longest_common_prefixes(master_list, input_list):
    '''
    master_list: a master list of strings
    input_list: a list of strings
    returns: a list of the longest matching prefix for each element of the input list
    '''
    longest_prefixes = []
    for i, input_value in enumerate(input_list):
        longest_prefix = input_value
        for master_value in master_list:
            longest_prefix = longest_common_prefix(longest_prefix, master_value)
        longest_prefixes.append(longest_prefix)
    return longest_prefixes

This gives us a list of the longest matching prefixes for each element of the input list. But how do we get our final output, which is the shortest list of strings that correctly matches all of the input strings?

In this case, we can use a simple greedy approach. We can iterate through the longest_prefixes list, and for each element, we can see if we can eliminate that element without causing any of the other elements to be incorrectly matched. If so, we can safely remove that element from the list, and continue iterating.

def shortest_matching_prefixes(longest_prefixes):
    '''
    longest_prefixes: a list of the longest matching prefix for each element of the input list
    returns: a list of the shortest matching prefixes for all of the input list
    '''
    shortest_prefixes = longest_prefixes
    while True:
        # Iterate over all of the longest prefixes
        for longest_prefix in longest_prefixes:
            # See if there is a longest_prefix that is a suffix of any other longest prefix
            # If so, it can be safely removed
            if any(longest_prefix.startswith(other_longest_prefix) and longest_prefix != other_longest_prefix
                   for other_longest_prefix in longest_prefixes):
                shortest_prefixes.remove(longest_prefix)
                break
        else:
            # No longest prefix can be safely removed
            return shortest_prefixes

Putting these functions together, we have our solution.

def wildcard_matching(master_list, input_list):
    '''
    master_list: a master list of strings
    input_list: a list of strings
    returns: a list of the shortest matching prefixes for all of the input list
    '''
    longest_prefixes = longest_common_prefixes(master_list, input_list)
    shortest_prefixes = shortest_matching_prefixes(longest_prefixes)
    return shortest_prefixes


'''
Notes on Time Complexity
The longest_common_prefixes function is linear in the number of elements in the master list, and quadratic with respect to the number of elements in the input list. We can follow the same pattern in the longest_prefixes and shortest_prefixes functions, for a total running time of O(m * n^2), where m is the number of elements in the master list, and n is the number of elements in the input list.

The wildcard_matching function should be replaced with a function that can control the size of the input list. This function would be linear in the number of elements in the master list, and would take some constant amount of time to do processing per element of the input list, giving a total running time of O(m * n).

This solution can be further optimized by using a more complex algorithm to find the longest common prefix of two strings, but the time complexity will remain O(m * n). In this case, it might be better to use a suffix tree, which can be constructed in linear time.

'''
  • Related