Group numpy array elements without for-loop-CodePudding

After doing some text processing, I've got a list of tokens and a list of sentence indices, one for each token. Now I'd like to reassemble the tokens into sentences. I've used Numpy, but I feel like there's a better/faster/more-numpy-ish way to do this...without a for loop. There could be a lot more than two sentences in the future.

import numpy as np

all_tokens = np.array(['I', 'spent', 'a', 'lot', 'of', 'time', ',', 'money', ',', 'and', 'effort', 'childproofing', 'my', 'house', '.', 'However', ',', 'the', 'kids', 'still', 'get', 'in', '.'])
sent_ids = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1])

new_sents = []
for unique_sent_id in np.unique(sent_ids):
    sent_tokens = all_tokens[sent_ids == unique_sent_id].tolist()
    new_sents.append(' '.join(sent_tokens))

Result: ["I spent a lot of time , money , and effort childproofing my house .", "However , the kids still get in ."]

CodePudding user response：

Assuming sent_ids is ordered, you can find out the position where sent_id has changed and then split tokens based on that:

list(map(" ".join, np.split(all_tokens, np.flatnonzero(np.diff(sent_ids) != 0) 1)))
# ['I spent a lot of time , money , and effort childproofing my house .', 'However , the kids still get in .']

CodePudding user response：

Seems to me that using sent_ids == unique_sent_id in every iteration might mean that your program has to iterate through the entire, potentially very long sent_ids array every time.

Not sure if this is what you imagine but here is how I would do it to ensure only one iteration takes place because you iterate through both arrays only once and from left to right:

import numpy as np

all_tokens = np.array(
    ['I', 'spent', 'a', 'lot', 'of', 'time', ',', 'money', ',', 'and', 'effort', 'childproofing', 'my', 'house', '.',
     'However', ',', 'the', 'kids', 'still', 'get', 'in', '.'])
sent_ids = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1])

new_sents = []
current_sent_id = -1
for token, sent_id in zip(all_tokens, sent_ids):
    if sent_id != current_sent_id:
        new_sents.append(token)
        current_sent_id = sent_id
    else:
        new_sents[-1]  = " "   token