Home > Mobile >  Remove reversed similar phrases in list of list python
Remove reversed similar phrases in list of list python

Time:11-17

There are some similar phrases and words in list of list and I wish to remove them.

Original list:

lst1 = [['daniel philips', 'philips daniel', 'daniel philips william'],['cherry', 'mary', 'cherry mary']]

Looking for output:

lst2 = [['daniel philips william'],['cherry mary']]

My code:

keyword_list = []
for list_a in lst1:
    temp_list = []
    while len(list_a)>0:
        popped_item = list_a.pop(0)
        popped_item = str(popped_item)

        is_subset = False
        for item in list_a:
            if popped_item in item:
                is_subset = True
                break

        if is_subset == False:
            temp_list.append(popped_item)
    keyword_list.append(temp_list)

However, it is not giving me the desired results.

CodePudding user response:

For your particular inputs (where there aren't any disjoint names in either list), all you really need is the longest element of each list:

>>> lst1 = [['daniel philips', 'philips daniel', 'daniel philips william'],['cherry', 'mary', 'cherry mary']]
>>> [max(s, key=len) for s in lst1]
['daniel philips william', 'cherry mary']

To actually determine whether any of the names are subsets of the others, though, turn them into actual sets so you can simply compare them and filter out the sets that are subsets of the longest set (or any other set in the list):

>>> [[n for n in s if not any(set(n.split()) < set(m.split()) for m in s)] for s in lst1]
[['daniel philips william'], ['cherry mary']]

CodePudding user response:

A fairly explicit solution would be something like this:

lst1 = [['daniel philips', 'philips daniel', 'daniel philips william'], ['cherry', 'mary', 'cherry mary']]

result = [[
    name for name in group
    if not any(all(part in other_name.split() for part in name.split()) 
    for other_name in group 
    if other_name != name)
] for group in lst1]

print(result)

I quite liked the solution @DYZ posted but then removed though:

result = [[max(zip(map(set, map(str.split, group)), group))[1]] for group in lst1]

In both cases the result:

[['daniel philips william'], ['cherry mary']]

However, which one you prefer depends on what you need to happen in this case:

lst1 = [['john', 'john doe', 'mary white']]

If you need: [['john doe', 'mary white']] my solution works, otherwise you might prefer @DYZ if you just want [['john doe']] and want to ignore the later mary white.

  • Related