There are some similar phrases and words in list of list and I wish to remove them.
Original list:
lst1 = [['daniel philips', 'philips daniel', 'daniel philips william'],['cherry', 'mary', 'cherry mary']]
Looking for output:
lst2 = [['daniel philips william'],['cherry mary']]
My code:
keyword_list = []
for list_a in lst1:
temp_list = []
while len(list_a)>0:
popped_item = list_a.pop(0)
popped_item = str(popped_item)
is_subset = False
for item in list_a:
if popped_item in item:
is_subset = True
break
if is_subset == False:
temp_list.append(popped_item)
keyword_list.append(temp_list)
However, it is not giving me the desired results.
CodePudding user response:
For your particular inputs (where there aren't any disjoint names in either list), all you really need is the longest element of each list:
>>> lst1 = [['daniel philips', 'philips daniel', 'daniel philips william'],['cherry', 'mary', 'cherry mary']]
>>> [max(s, key=len) for s in lst1]
['daniel philips william', 'cherry mary']
To actually determine whether any of the names are subsets of the others, though, turn them into actual set
s so you can simply compare them and filter out the sets that are subsets of the longest set (or any other set in the list):
>>> [[n for n in s if not any(set(n.split()) < set(m.split()) for m in s)] for s in lst1]
[['daniel philips william'], ['cherry mary']]
CodePudding user response:
A fairly explicit solution would be something like this:
lst1 = [['daniel philips', 'philips daniel', 'daniel philips william'], ['cherry', 'mary', 'cherry mary']]
result = [[
name for name in group
if not any(all(part in other_name.split() for part in name.split())
for other_name in group
if other_name != name)
] for group in lst1]
print(result)
I quite liked the solution @DYZ posted but then removed though:
result = [[max(zip(map(set, map(str.split, group)), group))[1]] for group in lst1]
In both cases the result:
[['daniel philips william'], ['cherry mary']]
However, which one you prefer depends on what you need to happen in this case:
lst1 = [['john', 'john doe', 'mary white']]
If you need: [['john doe', 'mary white']]
my solution works, otherwise you might prefer @DYZ if you just want [['john doe']]
and want to ignore the later mary white
.