Home > Back-end >  Make GetHashCode method behave the same for strings for different processes
Make GetHashCode method behave the same for strings for different processes

Time:03-17

If I run this:

Console.WriteLine("Foo".GetHashCode());
Console.WriteLine("Foo".GetHashCode());

it will print the same number twice but if I run the program again it will print a different number.

According to Microsoft and other places on the internet we cannot rely on GetHashCode function to return the same value. But if I plan on using it on strings only how can I make use of it and expect to always return the same value for the same string? I love how fast it is. It will be great if I could get the source code of it and use it on my application.

  • Reason why I need it (you may skip this part)

    I have a lot of complex objects that I need to serialize and send them between inter process communication. As you know BinaryFormatter is now obsolete so I then tried System.Text.Json to serialize my objects. That was very fast but because I have a lot of complex objects deserialization did not work well because I am making heavy use of polymorphism. Then I tried Newtonsoft (json.net) and that work great with this example: https://stackoverflow.com/a/71398251/637142. But it was very slow. I then decided I will use the best option and that is ProtoBuffers. So I was using protobuf-net and that worked great but the problem is that I have some objects that are very complex and it was a pain to place thousands of attributes. For example I have a base class that was being used by 70 other classes I had to place an attribute of inheritance for every single one it was not practical. So lastly I decided to implement my own algorithm it was not that complicated. I just have to traverse the properties of each object and if one property was not a value type then traverse them again recursively. But in order for this custom serialization that I build to be fast I needed to store all reflection objects in memory. So I have a dictionary with the types and propertyInfos. So the first time I serialize it will be slow but then it is even faster than ProtoBuf! So yes this approach is fast but every process must have the same exact object otherwise it will not work. Another tradeoff is that it's size is larger than protobuf because every time I serialize a property I include the full name of that property before. As a result I want to hash the full name of the property into an integer (4 bytes) and the GetHashCode() function does exactly that!

A lot of people may suggest that I should use MD5 or a different alternative but take a look at the performance difference:

// generate 1 million random GUIDS
List<string> randomGuids = new List<string>();
for (int i = 0; i < 1_000_000; i  )
    randomGuids.Add(Guid.NewGuid().ToString());

// needed to measure time
var sw = new Stopwatch();
sw.Start();


// using md5 (takes aprox 260 ms)
using (var md5 = MD5.Create())
{
    sw.Restart();
    foreach (var guid in randomGuids)
    {
        byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(guid);
        byte[] hashBytes = md5.ComputeHash(inputBytes);
        // make use of hashBytes to make sure code is compiled
        if (hashBytes.Length == 44)
            throw new Exception();
    }
    var elapsed = sw.Elapsed.TotalMilliseconds;
    Console.WriteLine($"md5: {elapsed}");
}

// using .net framework 4.7 source code (takes aprox 65 ms)
{
    [System.Security.SecuritySafeCritical]  // auto-generated
    [ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]
    static int GetHashCodeDotNetFramework4_7(string str)
    {

#if FEATURE_RANDOMIZED_STRING_HASHING
if(HashHelpers.s_UseRandomizedStringHashing)
{
    return InternalMarvin32HashString(this, this.Length, 0);
}
#endif // FEATURE_RANDOMIZED_STRING_HASHING

        unsafe
        {
            fixed (char* src = str)
            {

#if WIN32
        int hash1 = (5381<<16)   5381;
#else
                int hash1 = 5381;
#endif
                int hash2 = hash1;

#if WIN32
        // 32 bit machines.
        int* pint = (int *)src;
        int len = this.Length;
        while (len > 2)
        {
            hash1 = ((hash1 << 5)   hash1   (hash1 >> 27)) ^ pint[0];
            hash2 = ((hash2 << 5)   hash2   (hash2 >> 27)) ^ pint[1];
            pint  = 2;
            len  -= 4;
        }
 
        if (len > 0)
        {
            hash1 = ((hash1 << 5)   hash1   (hash1 >> 27)) ^ pint[0];
        }
#else
                int c;
                char* s = src;
                while ((c = s[0]) != 0)
                {
                    hash1 = ((hash1 << 5)   hash1) ^ c;
                    c = s[1];
                    if (c == 0)
                        break;
                    hash2 = ((hash2 << 5)   hash2) ^ c;
                    s  = 2;
                }
#endif
#if DEBUG
                // We want to ensure we can change our hash function daily.
                // This is perfectly fine as long as you don't persist the
                // value from GetHashCode to disk or count on String A 
                // hashing before string B.  Those are bugs in your code.
                hash1 ^= -484733382;
#endif
                return hash1   (hash2 * 1566083941);
            }
        }
    }

    sw.Restart();
    foreach (var guid in randomGuids)
        if (GetHashCodeDotNetFramework4_7(guid) == 1234567)
            throw new Exception("this will probably never happen");

    var elapsed = sw.Elapsed.TotalMilliseconds;
    Console.WriteLine($".NetFramework4.7SourceCode: {elapsed}");
}

// using .net 6 built in GetHashCode function (takes aprox: 22 ms)
{
    sw.Restart();
    foreach (var guid in randomGuids)
        if (guid.GetHashCode() == 1234567)
            throw new Exception("this will probably never happen");

    var elapsed = sw.Elapsed.TotalMilliseconds;
    Console.WriteLine($".net6: {elapsed}");
}

Running this in release mode these where my results:

md5: 254.7139
.NetFramework4.7SourceCode: 74.2588
.net6: 23.274

I got the source code from .NET Framework 4.8 from this link: https://referencesource.microsoft.com/#mscorlib/system/string.cs,8281103e6f23cb5c

Anyways searching on the internet I have found this helpful article: https://andrewlock.net/why-is-string-gethashcode-different-each-time-i-run-my-program-in-net-core/

and I have done exactly what it tells you to do and I have added:

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
    <runtime>
        <UseRandomizedStringHashAlgorithm enabled="1" />
    </runtime>
</configuration>


to my app.config file and still I get different values for "foo".GetHashCode() every time I run my application.

How can I make the GetHashcode() method return always the same value for the string "foo" in .net 6?


Edit

I will just use the solution of .net framework 4.8 source code that took 73ms to execute and move on. I was just curios to understand why was the build in hashcode so much faster.

At least I understand now why the hash is different every time. By looking at the source code of .net 6 the reason why it has a different hash every time is because of this:


namespace System
{
    internal static partial class Marvin
    {

        ... .net source code
        ....

        public static ulong DefaultSeed { get; } = GenerateSeed();

        private static unsafe ulong GenerateSeed()
        {
            ulong seed;
            Interop.GetRandomBytes((byte*)&seed, sizeof(ulong));
            return seed;
        }
    }
}

As a result I have tried this just for fun and still did not work:

    var ass = typeof(string).Assembly;
    var marvin = ass.GetType("System.Marvin");
    var defaultSeed = marvin.GetProperty("DefaultSeed");
    var value = defaultSeed.GetValue(null); // returns 3644491462759144438

    var field = marvin.GetField("<DefaultSeed>k__BackingField", BindingFlags.NonPublic | BindingFlags.Static);
    ulong v = 3644491462759144438;
    field.SetValue(null, v);

but on the last line I get the exception: System.FieldAccessException: 'Cannot set initonly static field '<DefaultSeed>k__BackingField' after type 'System.Marvin' is initialized.'

But still even if this worked it would be very unsfafe. I rader have something execute 3 times slower and move on.

CodePudding user response:

Why not to use the implementation suggested on the article you shared?

I'm copying it for reference:

static int GetDeterministicHashCode(this string str)
{
    unchecked
    {
        int hash1 = (5381 << 16)   5381;
        int hash2 = hash1;

        for (int i = 0; i < str.Length; i  = 2)
        {
            hash1 = ((hash1 << 5)   hash1) ^ str[i];
            if (i == str.Length - 1)
                break;
            hash2 = ((hash2 << 5)   hash2) ^ str[i   1];
        }

        return hash1   (hash2 * 1566083941);
    }
}
  •  Tags:  
  • c#
  • Related