I have dataframe:
import pandas as pd
data = {'token_1': [['cat', 'run','today'],['dog', 'eat', 'meat']],
'token_2': [['cat', 'in', 'the' , 'morning','cat', 'run', 'today',
'very', 'quick', 'cat','today', 'jump', 'and', 'run', 'run', 'cat', 'today'],['dog', 'eat', 'meat', 'chicken', 'from', 'bowl','dog','see','meat','eat']]}
df = pd.DataFrame(data)
To find words from token_1
column in token_2
column array I use this:
lst_index = [[i for i, x in enumerate(b) if x in a] for a, b in zip(df['token_1'], df['token_2'])]
print(lst_index)
This gives me several indexes where the words enter:
[[0, 4, 5, 6, 9, 10, 13, 14, 15, 16], [0, 1, 2, 6, 8, 9]]
But I need to find the indixes for which the words are preferably in the same order as I have in the token_1 array, so that will be only:
[[4,5,6], [0,1,2]]
CodePudding user response:
You can use a custom function to find the position of the first matching sublist (if any) in the other list:
def sublist(l, l_ref):
# for each word in the list
for pos, word in enumerate(l):
# if we have enough words left to compare
# and if it matches the first word of the reference
if pos <= len(l)-len(l_ref) and word == l_ref[0]:
# if all the next N words match (N being the length of the ref)
if all(a==b for a,b in zip(l[pos:pos len(l_ref)], l_ref)):
return list(range(pos, pos len(l_ref)))
[sublist(l2, l1) for l1, l2 in zip(df['token_1'], df['token_2'])]
output:
[[4, 5, 6], [0, 1, 2]]