Home > Mobile >  Always Generate the Same Numbers to be Discarded Given a Percentage
Always Generate the Same Numbers to be Discarded Given a Percentage

Time:11-10

I'm working with a very long list of numbers, say 1.5 billion. I need a way to specify a percentage of numbers that I want to keep, and the rest discard. Now I know I can use a Random Number Generator to randomly decide if I should keep it or not, but the problem is that I need the numbers to keep/discard to always be the same. Meaning, if I run the program and it decides to discard indexes 2, 5, and 10, the next time I run the program, it must discard 2, 5, and 10 as well. This is very important.

I'm also facing an issue with memory. To generate a huge list of bools to determine which numbers are discarded and which are not (if we decided to go that way, for example), the profiler says the program uses around 15gb of memory, which is already too much considering I have yet another list of 1.5 billion numbers. Here's my code for that if that matters:

        static bool[] GenerateShouldAddList(int totalCombos, decimal percentToAdd)
        {
            Random RNG = new Random();
            bool[] bools = new bool[totalCombos];
            int percent = (int)(percentToAdd * 100);

            for (int i = 0; i < totalCombos; i  )
            {
                int randNum = RNG.Next(0, 101);
                bools[i] = randNum < percent;
            }

            return bools;
        }

So I'm thinking, to avoid making a huge list, is there a way to make a function that will take in the index number (say index 5364), the total numbers (1.5 billion) and the percentage that you want to keep, and then return to me whether I should add that specific index or not? And if I run each index one at a time through that function, I should only be left with the percentage of numbers I specified. And most importantly, this function should always return the same result for the same index (if the totalNumbers and the percentage don't change). I'm thinking this isn't possible, but I also have hope there's people on here that are much smarter than me. Any help is appreciated!

CodePudding user response:

Since you have a lot of items, I suggest working with enumerations, IEnumerable<T> instead of arrays (which can well be not fit memory). Another requirement - repeatable selection - can be solved with seed: we create new Random(seed) and have the same sequence again and again:

static IEnumerable<bool> GenerateShouldAddList(int totalCombos, decimal percentToAdd) {
  Random RNG = new Random(789); // Whatever seed you like here 

  double threshold = percentToAdd / 100.0;

  for (int i = 0; i < totalCombos;   i) 
    yield return RNG.NextDouble() < threshold;
}

If you insist on having an array, you can add .ToArray(), e.g.

using System.Linq;

...

bool[] array = GenerateShouldAddList(10000, 5.0m).ToArray();

But I really doubt if you should do this.

CodePudding user response:

It sounds like you are wanting a function that takes in an index and a percentage and always gives the same result of whether or not that index should be kept, but in a sufficiently random way. This can be achieved by using a hashing algorithm so that the input is always hashed to the same output but in a random way. I tested the below with 10,000 indexes and at 10% it kept 1006 at 50% it kept 4998 and at 90% it kept 9006, so the percentages kept are fairly close to what was requested while still being random.

using System.Security.Cryptography;

public static class ToKeepOrNotToKeep
{
    private static readonly MD5 _md5 = MD5.Create();

    public static bool AtIndex(int index, double percentToKeep)
    {
        var byteArray = BitConverter.GetBytes(index);
        var hash = _md5.ComputeHash(byteArray);
        //I know that the hash is 16 bytes, and here we are converting
        //only the first 8 bytes to a ulong, but it's still random and
        //should work just as well as if we used all 16 bytes for our
        //threshold test
        var number = BitConverter.ToUInt64(hash, 0);
        var threshold = ulong.MaxValue * percentToKeep;

        if (number <= threshold)
            return true;
        else
            return false;
    }
}

CodePudding user response:

You seem to want a function which takes an index and a percent value, and returns a boolean value, the result of which would be same whether you call is_kept(index=2, percent=50) before or after is_kept(index=50000, percent=50) and don't want to store anything.

For this, I'd generate a hash of the index, and then treat that as a number, divide by the maximum hash and compare with the percent. The would given random-like behaviour without having to allocate any memory for state or a long list of flags.

  • Related