How do you find whether a list is a sublist of another in Python while accounting for duplicates-CodePudding

I'm currently trying to figure out whether a list of strings is a valid sublist of another list of strings. This question has been asked many times before but I haven't seen a solution for when either the list or sublist includes duplicates.

Say the list of letters is ['A', 'B', 'C', 'D'] and the word is ['A', 'B', 'C']. There are many ways to check that the word is a valid sublist. But if the word is ['A', 'B', 'C', 'C'], I need the output to return that the word is an invalid sublist, because there aren't two available "C's" in the list of letters. This means that using the all() function as well as sets doesn't work because they do not check for duplicates properly.

I have also tried using some sort of tracker and string.rfind() to see if all of the letters in the sublist are letters in the list, but this fails as well. Here is the code that I tried:

def firstFilter(dictionary, board):
    initFilter = []
    for word in dictionary:
        tracker = 0
        if 3 <= word <= 16:
            for j in range(len(board)):
                for k in range(len(board[j])):
                    letter = board[j][k]
                    check = word.rfind(letter)
                    if check != -1:
                        tracker  = 1
            if tracker >= len(word):
                initFilter.append(word)
    initFilter = sorted(initFilter, key=len, reverse=True)
    print('Words available:', len(initFilter))
    return initFilter

This code checks each letter in a list of letters to see if the letter is found in the word, where the full list of letters is the list, and each word is the sublist. But this method also has an issue. If there are duplicates in the full letter list, then the tracker is longer than the length of the word. For instance, if the letter list is ['A', 'B', 'C', 'C'] and word is ['A', 'B', 'C'], then the tracker takes value 4 and the length of the word is 3. This is why I use >= and not ==.

But this runs into another problem if the letter list is ['A', 'B', 'C', 'C'] and the word list is and word is ['A', 'B', 'C', 'D']. The duplicate letters in the letter list cause the tracker to take value 4, and the word length is 4, so the code returns that the word is a valid sublist of the letter list, even though there is no 'D' in the letter list.

Is there any way I can avoid both of these problems? I am a novice Python programmer, but I couldn't find any help from past questions on this site because none of them I could find addressed both of these duplicate issues when the list and/or the sublist contains duplicates.

CodePudding user response：

You can use collections.Counter, by subtracting the large Counter from the small one. If the output is not empty, then the small is not a subset:

def issubset_replicate(A, B):
    '''check if A is a subset of B'''
    from collections import Counter
    return Counter(A) - Counter(B) == Counter()
    
issubset_replicate('ABC', 'ABCCD')
# True

issubset_replicate('ABCCC', 'ABCCD')
# False

issubset_replicate('ABE', 'ABCCD')
# False

how does it work?

When subtracting two counters, if a value becomes null or negative, the key is deleted. Thus is A is a subset of B, taking into account replicates, Counter(A) - Counter(B) should be null.

CodePudding user response：

You need to remove the words letters from the available letters, since there can be multiple time the same letter in a word.

Give this function the word and the available letters, and it will return false if the word is invalid, or the remaining letters if the word is valid :

def check_word(word, available_letters):
  for w in word:
    if not w in available_letters:
      return False
    available_letters.remove(w)
  return available_letters

examples :

>>> print(check_word(['a', 'b', 'c', 'd'], ['a', 'b', 'c']))
False
>>> print(check_word(['a', 'b', 'c', 'd'], ['a', 'b', 'c', 'd']))
[]
>>> print(check_word(['a', 'b', 'c', 'd'], ['a', 'b', 'c', 'c', 'd']))
['c']

CodePudding user response：

Make dictionaries of each list where the keys are the letters and the values are the letter counts.

  reference - ['A', 'B', 'C', 'D'] --> {'A':1,'B':1,'C':1,'D':1}
  test - ['A', 'B', 'C', 'C'] --> {'A':1,'B':1,'C':2}

To be a subset,
- all the keys in test must be in the reference AND
- the values in test must be <= the values in the reference.

CodePudding user response：

You could use itertools to check if the sequence of interest (b) is in any of the n-length combinations of a:

import itertools

def func(a, b):
    if len(b) > len(a):
        return False
    a, b = sorted(a), sorted(b)
    for n in range(1, len(a)   1):
        if tuple(b) in itertools.combinations(a, n):
            return True
    return False

>>> func(["A", "B", "C", "D"], ["A", "B", "C"])
True
>>> func(["A", "B", "C", "D"], ["A", "B", "C", "C"])
False