How do I find string that has a word or group of matched from a list?-CodePudding

I have a long list of strings (or column in pandas data frame), from which I want to be able to separate strings based on some values in a different reference list. I want to get it done pythonic way, not just iterate and match.

Input:
my_list_or_column = ["this is a test", "blank text", "another test", "do not select this" ]
ref_list = ["test", "conduct"]

Now, I should be able to separate sentences that has a word in ref_list.

Output:
match = ["this is a test" .... ]
did_not_match = ["do not select this"]

Any help?

CodePudding user response：

How about:

my_list_or_column = ["this is a test", "blank text", "another test", "do not select this" ]
ref_list = ["test", "conduct"]

def is_contain(col):
  for ref in ref_list:
    if ref in col:
      return True
  return False

print(list(filter(lambda x: is_contain(x), my_list_or_column)))

CodePudding user response：

You can convert ref_list to a set and look through that instead of iterating over a list. This may be useful especially if ref_list is large.

did_not_match = []
match = []
my_set = set(ref_list)
for string in my_list_or_column:
    set_string = set(string.split())
    if set_string - my_set != set_string:
        match.append(string)
    else:
        did_not_match.append(string)

Since you mentioned that my_list_or_column could be a pandas DataFrame column, you can also create a boolean mask and filter for the relevant text as:

my_Series = pd.Series(my_list_or_column)
mask = my_Series.str.contains('|'.join(ref_list))
match = my_Series[mask].tolist()
did_not_match = my_Series[~mask].tolist()

Output:

>>> print(match)
['this is a test', 'another test']

>>> print(did_not_match)
['blank text', 'do not select this']