I have a data frame with two columns: sentence
containing text and selector
containing arrays of tuples of varying lengths.
Consider the following data frame as an example:
df = pd.DataFrame({'sentence': ['KEEP some of the words from this sentence.',
'Keep SOME of THE words from this sentence.',
'KEEP some OF the WORDS from this sentence.',
'Keep SOME of THE words FROM this SENTENCE.'],
'selector': [[(10, 0, 1)],
[(10, 1, 2), (10, 3, 4)],
[(10, 0, 1), (10, 2, 3), (10, 4, 5)],
[(10, 1, 2), (10, 3, 4), (10, 5, 6), (10, 7, 8)]]})
I now want to select the words from sentence
at the position indicated by the second element in each tuple (ignoring the 10
in each tuple). E.g. for the first row, I want the token in column sentence
at the second position of all tuples (of which there is only one: (10, 0, 1)
), i.e. the token at position 0
: KEEP
. (For clarity, I have spelled all words to be selected in ALL CAPS).
I would like to get a dataframe looking like this:
sentence selector selected_tokens
KEEP some of the words from this sentence. [(10, 0, 1)], ['KEEP']
KEEP some OF the WORDS from this sentence. [(10, 0, 1), (100, 2, 3), (10, 4, 5)], ['KEEP', 'OF', 'WORDS']
Keep SOME of THE words from this sentence. [(10, 1, 2), (10, 3, 4)], ['SOME', 'THE']
Keep SOME of THE words FROM this SENTENCE. [(10, 1, 2), (10, 3, 4), (10, 5, 6), (10, 7, 8)], ['SOME', 'THE', 'FROM', 'SENTENCE']
Accessing the first token works well using df['tok0_pos'] = df['selector'].str[0].str[1]
for the positions and df['words0'] = [txt.split()[loc] for txt, loc in zip(df['sentence'], df['tok0_pos'])]
for the tokens.
However, due to the variable lengths (the real data set contains 0-25 tuples in the column selectors
), this crashes quickly or is tedious.
Can someone point out how to best attain the column selected_tokens
in the sample dataset?
CodePudding user response:
One solution:
df["selected_tokens"] = [[sent[s] for _, s, _ in select] for sent, select in zip(df["sentence"].str.split(), df["selector"])]
print(df["selected_tokens"])
Output
0 [KEEP]
1 [SOME, THE]
2 [KEEP, OF, WORDS]
3 [SOME, THE, FROM, SENTENCE.]
Name: selected_tokens, dtype: object
An alternative solution, is to use numpy to take advantage of the advance indexing features:
import numpy as np
sentences = df["sentence"].str.split().apply(np.array)
indices = [[s[1] for s in select] for select in df["selector"]]
df["selected_tokens"] = [sentence[i] for sentence, i in zip(sentences, indices)]
CodePudding user response:
This should work:
def get_word_at_index(sentence, index):
return [sentence.split()[i[1]] for i in index]
df.loc[:, 'selected_tokens'] = df.apply(lambda x: get_word_at_index(x["sentence"], x["selector"]), axis=1)