Home > database >  How to generate a large shuffled set of unqiue numbers one by one?
How to generate a large shuffled set of unqiue numbers one by one?

Time:08-29

Let's say I want to generate a shuffled list of 1000 unique numbers (the numbers can be just a range from 1 to 1000). I can do that by expanding a range from 1 to 1000 to an actual list and then randomly shuffling elements. Language doesn't matter, but in python it might be implemented like this: random.shuffle(list(range(1000))).

But what if I need to generate, let's say, 10 billion numbers? Expanding the range to a list would require a lot of memory (about 74 GB of memory if a number is stored in 8 bytes). Shuffling would also require a lot of memory and time. But it's not required to generate all the numbers at a time, but instead it would be better to generate them one by one, saving the state, and not filling up the memory by storing all the numbers, but only one instead.

It's still required for numbers to be unique. Is there any algorithms for this purpose?

It would be also great if there was a way to fast restore the "state" of the generator at a given generation step. For example, if I have already generated 1 million numbers and the next number must be 10, it would be great if I could efficiently (without regenerating that million numbers again) restore the state at a step of 1 million and generate the next number - 10.

Are there such algorithms?

CodePudding user response:

10 billion numbers? Expanding the range to a list would require a lot of memory (about 74 GB of memory if a number is stored in 8 bytes)

One option is to track the numbers you've already returned using one bit each, in which case 10 billion numbers need 10 billion bits, so 1.25 GB. If you need a generator for a set of shuffled numbers that big, but will only end up retrieving e.g. 50% or 80%, the get-next-number generator could loop generating random indices in the bit array until it finds one not previously set. Generating lots of random numbers will be pretty slow though. If the generator must actually return all the 10 billion numbers eventually, then a loop and random probing will get exponentially less efficient, so you start having to loop for compromises.

One mitigation is to probe at a random bit, but if it's already set... scan forwards to find the next unset bit... that will compromise some of the shuffling randomness properties, which you may or may not care about, but it'll let you get much closer to 100% before the performance tanks. You could do something where when you cross a certain threshold - say 97%, or the first time you search 100 bits and find them all already set, you then create a smaller container with the numbers that haven't been returned yet (not just bits representing them), shuffle that and start returning the remaining elements from there.

It would be also great if there was a way to fast restore the "state" of the generator at a given generation step. For example, if I have already generated 1 million numbers and the next number must be 10, it would be great if I could efficiently (without regenerating that million numbers again) restore the state at a step of 1 million and generate the next number - 10.

If you were using the above approach with a big bit-set, I can't see a fast way to restore to an arbitrary point. If the restore point was never too far back, you could obviously keep a container of intervening returned numbers and clear them from the bitset, then restore the PRNG to the earlier state, but that gets slower and more memory-consuming as the number of intervening results increases.

CodePudding user response:

Are there such algorithms?

Yes. You can acchive exaclty what you are looking for using a invertible hash function.

Say you want to generate random non-repeating numbers between 0 and n: You need a seeded invertible hash function for m=log2(next_power_of_two(n)) bits and simply iterate through all numbers from 0 to n hash those and reject every number greater than n.

Here is a great write-up about it: https://andrew-helmer.github.io/permute/

There is one small caveat: The range must be sufficiently big to be able to write good hash functions for it.

It would be also great if there was a way to fast restore the "state" of the generator at a given generation step.

This is possible by just storing your last iteration index and the seed.

CodePudding user response:

Use an encryption with a large enough block size so your upper limit will fit into a single block. AES uses 128 bit blocks while DES uses 64 bit blocks. Other block sizes are possible.

Encryption is a one-to-one mapping, so given unique inputs you are guaranteed unique outputs, provided you keep the same key. Just encrypt the numbers 0, 1, 2, 3... keeping track of where you have got to, so you don't repeat a number.

Depending on your upper limit, you may need to use some Format Preserving Encryption technique if the initial encryption gives a number that is out of range. For numbers, cycle walking should be enough to get a number within range.

  • Related