I am looking for the most computationally efficient way to filter out arrays from one array, based on segment overlap in a second array, and that array has a different segmentation type.
This is my first array
iob = np.array(
[0, 1, 2, 2, 2, 2, 2, 2, 2, 0, 0, 1, 2, 2, 2, 2, 0, 0, 1, 2, 1, 2, 2, 0]
)
The number 1 is the start of each segment, the number 2 is the rest of the segment, and the number 0 indicates no segment. So for this array, the segments are 1, 2, 2, 2, 2, 2, 2, 2
, 1, 2, 2, 2, 2
, 1, 2
, and 1, 2, 2
.
This is my second array
output = np.array(
[0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0]
)
The segments are only defined by the number 1, and 0 indicates no segment. So the segments for this array are 1, 1, 1, 1
, 1, 1
, and 1, 1, 1
.
In the first array, I want filter out segments whose contents to not overlap by at least 50% by a segment in the 2nd array. In other words, if at least half of the contents in a segment in the first array overlaps with a segment in the 2nd array, I want to keep that segment.
So this is the desired result
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0])
I am looking for the most computationally efficient method to calculate this solution.
Current solution
I can get the segment indices using the technique described here https://stackoverflow.com/a/71297663/3259896
output_zero = np.concatenate((output, [0]))
zero_output = np.concatenate(([0], output))
iob_zero = np.concatenate((iob, [0]))
iob_starts = np.where(iob_zero == 1)[0]
iob_ends = np.where((iob_zero[:-1] != 0) & (iob_zero[1:] != 2))[0]
iob_pairs = np.column_stack((iob_starts, iob_ends))
output_starts = np.where((zero_output[:-1] == 0) & (zero_output[1:] == 1))[0]
output_ends = np.where((output_zero[:-1] == 1) & (output_zero[1:] == 0))[0]
output_pairs = np.column_stack((output_starts, output_ends))
Next, I directly compare all possible combination of segments to see which ones have at least a 50% overlap, and only keep those segments.
valid_pairs = []
for o_p in output_pairs:
for i_p in iob_pairs:
overlap = 1 min(o_p[1], i_p[1]) - max(o_p[0], i_p[0])
if overlap > np.diff(i_p)[0]/2:
valid_pairs.append(i_p)
valid_pairs = np.array(valid_pairs)
Finally, I used the filtered indices to create the desired array
final = np.zeros_like(output)
for block in valid_pairs:
final[block[0]:block[1] 1] = 1
final
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0])
I suspect that this may not be the most computationally efficient solution. It uses quite a few lines of code, and also used a nested for loop to do all possible comparisons, and uses another loop to create the desired array. I don't have a mastery on use of all of numpy's functions, and I am wondering if there is a more computationally efficient way to calculate this.
CodePudding user response:
Here's a 4-liner:
output = np.concatenate((output, [0]))
iob = np.concatenate((iob, [0]))
idx = np.dstack(np.where(iob == 1) tuple(np.array(np.where(np.diff(iob) < 0)) 1)).flatten()
desired_array = np.concatenate([np.full(len(x), 1 if ((np.sum(x==2) >= np.sum(x==1)) and x.sum()>0) else 0) for x in np.split(np.where(iob>0, 1, 0) output, idx)])[0:-1]
Output:
>>> desired_array
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0])
And here is a more verbose version
output_0 = np.concatenate((output, [0]))
iob_0 = np.concatenate((iob, [0]))
overlap = np.where(iob_0>0, 1, 0) output_0
iob_segment_starts = np.where(iob_0 == 1)
iob_segment_ends = tuple(np.array(np.where(np.diff(iob_0) < 0)) 1)
iob_segment_indxs = np.dstack(iob_segment_starts iob_segment_ends).flatten()
overlap_segments = np.split(overlap, iob_segment_indxs)
filtered_segment_params = (
(
len(x),
np.sum(x == 2) >= np.sum(x == 1) and x.sum() > 0,
)
for x in overlap_segments
)
filtered_segment_pieces = [
np.full(length, value, dtype=int)
for length, value in filtered_segment_params
]
filtered_array = np.concatenate(filtered_segment_pieces)[:-1]
filtered_array
Pytorch Version
import torch
output = torch.tensor(
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1]
)
iob = torch.tensor(
[0, 1, 2, 2, 2, 2, 2, 2, 2, 0, 0, 1, 2, 2, 2, 2, 0, 0, 1, 2, 1, 2, 2, 1]
)
output_0 = torch.cat((output, torch.tensor([0])))
iob_0 = torch.cat((iob, torch.tensor([0])))
overlap = torch.where(iob_0>0, 1, 0) output_0
iob_segment_starts = torch.where(iob_0 == 1)[0]
iob_segment_ends = torch.where(torch.diff(iob_0) < 0)[0] 1
iob_segment_cuts = torch.cat([
iob_segment_starts,
iob_segment_ends,
torch.tensor([0, overlap.shape[0]])
]).sort()[0]
iob_segment_sizes = torch.diff(iob_segment_cuts)
overlap_segments = torch.split(overlap, iob_segment_sizes.tolist(), dim=0)
filtered_segment_params = (
(
len(x),
torch.sum(x == 2) >= torch.sum(x == 1) and x.sum() > 0,
)
for x in overlap_segments
)
filtered_segment_pieces = [
torch.full((1, length), value, dtype=int).flatten(0)
for length, value in filtered_segment_params
]
filtered_array = torch.cat(filtered_segment_pieces)[:-1]
filtered_array