I have a ragged array represented as a contiguous block of memory along with its "shape", corresponding to the length of each row, and its "offsets", corresponding to the index of the first element in each row. To illustrate, I have an array conceptually like this:
[[0, 0, 0],
[1, 1, 1, 1],
[2],
[3, 3],
[4, 4]]
Represented in-memory as:
values: [0, 0, 0, 1, 1, 1, 1, 2, 3, 3, 4, 4]
shape: [3, 4, 1, 2, 2]
offsets: [0, 3, 7, 8, 10]
I may have in the hundreds of millions of rows with typically, say, 3-20 four-byte floats per row, though with no hard upper bound on the row length.
I wish to shuffle the rows of this array randomly. Since the array is ragged, I can't see how the Fisher-Yates algorithm can be applied in a straightforward manner. I see how I can carry out a shuffle by randomly permuting the array shape, pre-allocating a new array, and then copying over rows according to the permutation generating the new shape with some book-keeping on the indexes. However, I do not necessarily have the RAM required to duplicate the array for the purposes of this operation.
Therefore, my question is whether there is a good way to perform this shuffle in-place, or using only limited extra memory? Run-time is also a concern, but shuffling is unlikely to be the main bottleneck.
For illustration purposes, I wrote a quick toy-version in Rust here, which attempts to implement the shuffle sketched above with allocation of a new array.
CodePudding user response:
shape
is redundant since shape[i]
is offset[i 1]-offset[i]
(if you extend offset by one element containing the length of the values
array). But since your data structure has both these fields, you could shuffle your array by just in-place shuffling the two descriptor vectors (in parallel), using F-Y. This would be slightly easier if shape
and offset
were combined into an array of pairs (offset, length)
, which also might improve locality of reference, but it's certainly not critical if you have some need for the separate arrays.
That doesn't preserve the contiguity of the rows in the values
list, but if all your array accesses are through offset
, it will not require any other code modification.
It is possible to implement an in-place swap of two variable-length subsequences using a variant of the classic rotate-with-three-reversals algorithm. Given P V Q
, a sequence conceptually divided into three variable length parts, we first reverse P
, V
, and Q
in-place independently producing PR VR QR
. Then we reverse the entire sequence in place, yielding Q V P
. (Afterwards, you'd need to fixup the offsets
array.)
That's linear time in the length of the span from P to Q, but as a shuffle algorithm it will add up to quadratic time, which is impractical for "hundreds of millions" of rows.
CodePudding user response:
Here is an approach that we can calculate in time O(n log(n))
. I don't have a proof, but believe that the maximum space is O(k)
where k
is the maximum block size.
First note, shape
and offsets
are largely redundant. Because shape[i] = offset[i 1] - offset[i]
for all i
but the last. So with O(1)
extra data (which we already have in values.len()
) we can make shape
redundant, then (ab)use it, however we want, and then calculate it at the end.
So let's start by picking a random permutation of 0..(shape.len()-1)
and placing it in shape
. This will be where each element will go, and can be found in time O(n)
using Fisher-Yates.
Our idea now is to use quicksort to actually get them to the right places.
First, our pivot. For O(n)
work in a single pass we can add up the lengths of all blocks which will come before the median block, and also find the length of said median block.
Now quicksort is dependent upon being able to swap things. But we can't do that directly (your whole problem). However the idea is that we'll partition from the middle out. And so the values
, shape
and offsets
arrays will have beginning and ending sections that we haven't gotten to, and a piece in the middle that we've rewritten. Where those sections meet we'll also need to have queues of blocks copied off of the left and right and not yet written back. And, of course, we'll need to have a record of where the boundaries are.
So the idea is this.
set up the data structures.
copy a few blocks in the middle to one of the queues - enough to have a place for the median block to go.
record where the median will go
while have not finished partitioning:
pick the larger queue (break ties by farthest from its end, then randomly)
if it is empty:
read a block into it
figure out where its next block needs to be written
copy blocks in its way to the appropriate queue
write the block out
Where writing the block out means writing its elements to the right place, setting its offset, and setting its shape to the still final location for that block.
This operation will partition around the median block. Recursively repeat to sort each side into blocks being in their final position.