I have implemented a function Generating numbers in [0,x] by given string :
import org.apache.commons.codec.digest.DigestUtils;
public class BaseUtil {
public static int getIntValue(String stringText, String x) {
var modNum = new BigInteger(x);
String sha256hex = DigestUtils.sha256Hex(stringText);
var result = new BigInteger(sha256hex, 16);
int intValue = result.mod(modNum).intValue();
return intValue;
}
}
CodePudding user response:
Does the method Generating int number in
[0,x]
by given string follow uniform distribution?
Almost. It is difficult to entirely eliminate non-uniformity ... unless 2256 happens to be evenly divisible by x
.
Note that you are actually generating a number in [0,x)
... since the result cannot be x
.
You also asked about more efficient implementations than the one in your question.
public class BaseUtil {
public static int getIntValueV1(String stringText, int x) {
if (x <= 0) {
throw new InvalidArgumentException("x must be strictly positive");
}
MessageDigest digest = MessageDigest.getInstance("SHA-256");
byte[] hash = digest.digest(
stringText.getBytes(StandardCharsets.UTF_8));
return new BigInteger(hash).mod(new BigInteger(x)).intValue()
}
public static int getIntValueV2(String stringText, int x) {
if (x <= 0) {
throw new InvalidArgumentException("x must be strictly positive");
}
MessageDigest digest = MessageDigest.getInstance("SHA-256");
byte[] hash = digest.digest(
stringText.getBytes(StandardCharsets.UTF_8));
ByteBuffer buff = new ByteBuffer(hash);
long number = 0;
for (int i = 0; i < 8; i ) {
number = number ^ buff.getLong();
}
return (int)(number % x);
}
}
Preconditions:
Since the result of your method is an
int
, that implies thatx
must also be anint
.Since you want numbers in the range
[0,x)
, that implies thatx
must be greater than zero.
Implementations:
I am using the standard MessageDigest
class since it has been in Java SE since Java 5 (if not earlier).
The first version uses BigInteger
to minimize the non-uniformity when we reduce the bytes into a number in the range [0, x)
The second version uses long
arithmetic to compute the remainder. I think that means that the distribution might be a bit more non-uniform than the first version. (My math skills are too rusty ...) It would also be possible to eliminate the use of ByteBuffer
to convert the bytes to a sequence of longs.
I have not benchmarked these versions to see which is faster. But both should be faster than producing an intermediate hexadecimal string and parsing it.
Note that you could probably get away with using a less expensive hashing algorithm, depending on the actual use-case for this generator.