I have a long list of strings (or column in pandas data frame), from which I want to be able to separate strings based on some values in a different reference list. I want to get it done pythonic way, not just iterate and match.
Input:
my_list_or_column = ["this is a test", "blank text", "another test", "do not select this" ]
ref_list = ["test", "conduct"]
Now, I should be able to separate sentences that has a word in ref_list.
Output:
match = ["this is a test" .... ]
did_not_match = ["do not select this"]
Any help?
CodePudding user response:
How about:
my_list_or_column = ["this is a test", "blank text", "another test", "do not select this" ]
ref_list = ["test", "conduct"]
def is_contain(col):
for ref in ref_list:
if ref in col:
return True
return False
print(list(filter(lambda x: is_contain(x), my_list_or_column)))
CodePudding user response:
You can convert ref_list
to a set and look through that instead of iterating over a list. This may be useful especially if ref_list
is large.
did_not_match = []
match = []
my_set = set(ref_list)
for string in my_list_or_column:
set_string = set(string.split())
if set_string - my_set != set_string:
match.append(string)
else:
did_not_match.append(string)
Since you mentioned that my_list_or_column
could be a pandas DataFrame column, you can also create a boolean mask and filter for the relevant text as:
my_Series = pd.Series(my_list_or_column)
mask = my_Series.str.contains('|'.join(ref_list))
match = my_Series[mask].tolist()
did_not_match = my_Series[~mask].tolist()
Output:
>>> print(match)
['this is a test', 'another test']
>>> print(did_not_match)
['blank text', 'do not select this']