Home > Software engineering >  Find words in an array in a specific order in Dataframe Pandas
Find words in an array in a specific order in Dataframe Pandas

Time:06-29

I have dataframe:

import pandas as pd
data = {'token_1': [['cat', 'run','today'],['dog', 'eat', 'meat']],
        'token_2': [['cat', 'in', 'the' , 'morning','cat', 'run', 'today',
                      'very', 'quick', 'cat','today', 'jump', 'and', 'run', 'run', 'cat', 'today'],['dog', 'eat', 'meat', 'chicken', 'from', 'bowl','dog','see','meat','eat']]}


df = pd.DataFrame(data)

To find words from token_1 column in token_2 column array I use this:

lst_index = [[i for i, x in enumerate(b) if x in a] for a, b in zip(df['token_1'], df['token_2'])]
print(lst_index)

This gives me several indexes where the words enter:

[[0, 4, 5, 6, 9, 10, 13, 14, 15, 16], [0, 1, 2, 6, 8, 9]]

But I need to find the indixes for which the words are preferably in the same order as I have in the token_1 array, so that will be only:

[[4,5,6], [0,1,2]]

CodePudding user response:

You can use a custom function to find the position of the first matching sublist (if any) in the other list:

def sublist(l, l_ref):
    # for each word in the list
    for pos, word in enumerate(l):
        # if we have enough words left to compare
        # and if it matches the first word of the reference
        if pos <= len(l)-len(l_ref) and word == l_ref[0]:
            # if all the next N words match (N being the length of the ref)
            if all(a==b for a,b in zip(l[pos:pos len(l_ref)], l_ref)):
                return list(range(pos, pos len(l_ref)))

[sublist(l2, l1) for l1, l2 in zip(df['token_1'], df['token_2'])]

output:

[[4, 5, 6], [0, 1, 2]]
  • Related