After doing some text processing, I've got a list of tokens and a list of sentence indices, one for each token. Now I'd like to reassemble the tokens into sentences. I've used Numpy, but I feel like there's a better/faster/more-numpy-ish way to do this...without a for loop. There could be a lot more than two sentences in the future.
import numpy as np
all_tokens = np.array(['I', 'spent', 'a', 'lot', 'of', 'time', ',', 'money', ',', 'and', 'effort', 'childproofing', 'my', 'house', '.', 'However', ',', 'the', 'kids', 'still', 'get', 'in', '.'])
sent_ids = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1])
new_sents = []
for unique_sent_id in np.unique(sent_ids):
sent_tokens = all_tokens[sent_ids == unique_sent_id].tolist()
new_sents.append(' '.join(sent_tokens))
Result: ["I spent a lot of time , money , and effort childproofing my house .", "However , the kids still get in ."]
CodePudding user response:
Assuming sent_ids
is ordered, you can find out the position where sent_id
has changed and then split tokens based on that:
list(map(" ".join, np.split(all_tokens, np.flatnonzero(np.diff(sent_ids) != 0) 1)))
# ['I spent a lot of time , money , and effort childproofing my house .', 'However , the kids still get in .']
CodePudding user response:
Seems to me that using sent_ids == unique_sent_id
in every iteration might mean that your program has to iterate through the entire, potentially very long sent_ids array every time.
Not sure if this is what you imagine but here is how I would do it to ensure only one iteration takes place because you iterate through both arrays only once and from left to right:
import numpy as np
all_tokens = np.array(
['I', 'spent', 'a', 'lot', 'of', 'time', ',', 'money', ',', 'and', 'effort', 'childproofing', 'my', 'house', '.',
'However', ',', 'the', 'kids', 'still', 'get', 'in', '.'])
sent_ids = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1])
new_sents = []
current_sent_id = -1
for token, sent_id in zip(all_tokens, sent_ids):
if sent_id != current_sent_id:
new_sents.append(token)
current_sent_id = sent_id
else:
new_sents[-1] = " " token