I'm trying to import data from a CSV file, unfortunately there is no primary key that would allow me to uniquely identify a given row. So I created a dictionary in which the key is the value that GetHashCode returns to me. I use the dictionary because its search is much faster than searching with linq and where with conditions for several properties.
My GetHashCode override looks like this:
public override int GetHashCode()
{
unchecked
{
int hash = 17;
hash = hash * 23 this.Id.GetHashCode();
hash = hash * 23 this.Author?.GetHashCode() ?? 0.GetHashCode();
hash = hash * 23 this.Activity?.GetHashCode() ?? 0.GetHashCode();
hash = hash * 23 this.DateTime?.GetHashCode() ?? 0.GetHashCode();
return hash;
}
}
After fetching data from DB I do:
.ToDictionary(d => d.GetHashCode());
And here comes the problem, I checked the database and I don't have any duplicates when it comes to these four parameters. But when running the import I often get an error that the given key already exists in the dictionary, but if I run the import again for the same data the next time everything runs fine.
How can I fix this error? The import application is written in .net 5
Id - long
Author, Activity - string
DateTime - DateTime?
Unfortunately, this ID is more like FK is not unique, there may be many rows with the same id, author, activity, but e.g. a different datetime
CodePudding user response:
GetHashCode()
does NOT produce unique values, so using it as a key in a dictionary can give you the errors that you have observed.
You should implement GetHashCode()
AND IEquatable<T>
for your key type. Then you will be able to safely put instances of it into a hashing container, so long as there are no duplicate entries. (Items x
and y
will only be considered duplicates if the GetHashCode()
values are the same AND x.Equals(y)
returns true
).
So for example, your data key class could look like this:
public sealed class DataKey : IEquatable<DataKey>
{
public long Id { get; }
public string? Author { get; }
public string? Activity { get; }
public DateTime? DateTime { get; }
public DataKey(long id, string? author, string? activity, DateTime? dateTime)
{
Id = id;
Author = author;
Activity = activity;
DateTime = dateTime;
}
public bool Equals(DataKey? other)
{
if (other is null)
return false;
if (ReferenceEquals(this, other))
return true;
return Id == other.Id && Author == other.Author && Activity == other.Activity && Nullable.Equals(DateTime, other.DateTime);
}
public override bool Equals(object? obj)
{
return ReferenceEquals(this, obj) || obj is DataKey other && Equals(other);
}
public override int GetHashCode()
{
unchecked
{
var hashCode = Id.GetHashCode();
hashCode = (hashCode * 397) ^ (Author?.GetHashCode() ?? 0);
hashCode = (hashCode * 397) ^ (Activity?.GetHashCode() ?? 0);
hashCode = (hashCode * 397) ^ (DateTime?.GetHashCode() ?? 0);
return hashCode;
}
}
}
That's a lot of boilerplate code. Fortunately, if you are using a fairly recent version of C#/.NET you can use the record
type to simplify this to just:
public sealed record DataKey(
long Id,
string? Author,
string? Activity,
DateTime? DateTime);
The record
type implements IEquatable<T>
and GetHashCode()
correctly for you (for the specific types lone
, string?
and DateTime?
).
Note that both the example types above are immutable. It's very important when using hashing containers that the properties of a key that contribute to GetHashCode()
and Equals()
are immutable. If you put an item in a hashing container and then change any of those properties, nasty things happen.
CodePudding user response:
A hash by definition contains less information than the original and results in collisions. Using it as a dictionary key guarantees errors.
From the comments, it appears the real problem is using a composite key. You can use any type that uses value equality for this. Two options are ValueTuple
s and record
, eg :
.ToDictionary(d=>(d.Id,d.Author,d.Activity,d.DateTime));
A possible problem is that ValueTuple
s are mutable.
You can use record
or record struct
to create a predefined key type that uses value equality.
public record ActivityKey( int Id,
string Author,
string Activity,
DateTime DateTime);
...
.ToDictionary(d=>new ActivityKey(d.Id,d.Author,d.Activity,d.DateTime));
CodePudding user response:
You lose information when hashing.
You go from having multiple properties of various types (strings, datetime, numbers, etc.) and reduce that to a single integer. It's very possible that the hash function returns the same result for two sets of different property values.
GetHashCode
is not intended to represent a unique key.
Instead, it might be a better idea to actually generate a unique key for each line (using something like Guid
).
Or, perhaps, use the Id property that you seem to already have?
CodePudding user response:
It seems that you may be using a different sort of hashing scheme than you need to.
If you are hashing to represent your row data as a unique value, you'll probably want something longer than an int
.
Your GetHashCode()
implementation looks good. However, it is for use in hash tables, not representative ID hashes, which is probably what you want.
Try something like:
public class Record {
public int ID;
public string Author;
public string Activity;
public DateTime? DateTime;
public string GetRowHash() {
var builder = new System.Text.StringBuilder();
builder.Append(this.ID.ToString());
builder.Append(this.Author ?? "");
builder.Append(this.Activity ?? "");
builder.Append(this.DateTime?.ToString() ?? "");
using (var md5 = System.Security.Cryptography.MD5.Create()) {
byte[] buffer = System.Text.Encoding.ASCII.GetBytes(builder.ToString());
byte[] hash = md5.ComputeHash(buffer);
return Convert.ToBase64String(hash);
}
}
}
Then use GetRowHash()
as your ID. If you have duplicates, it will be because the row information is duplicated, not because you've overrun an int
hash.
You may need to change Convert.ToBase64String(...)
to something else depending on how you're storing these values in the database.
Incidentally, if you only have 4 data fields, it is way easier (and faster) to compare values in the database (using SQL) than in code. One good query to hunt out duplicates will work much more efficiently. You may even find loading the CSV directly to a table to be a good option, if your data are minimally clean enough.