Join multiple columns where one column is nullable in one table but not the other-CodePudding

I have a query that needs to join multiple columns, where one of the columns is now nullable in one of the tables.

    var personAchievement = await _context.Achievements
        .Join(_context.Persons, a => new { a.Region, a.PersonNumber }, p => new { p.Region, p.PersonNumber }, (achievement, person) => new { Achievement = achievement, Person = person })
        .Where(_ => _.Achievement.PersonNumber != null)

The Persons table uses Region (varchar in the db and string in my model) and PersonNumber (int in both db and model) as a compound unique id. Those two fields are used in a lot of tables that join to Person for various things. One table, Achievements, needs to make that PersonNumber field nullable in order to add other types of entities that will have Achievements. This query has been working for a long time exactly as above, until today when I make that one field a nullable int on the Achievements model.

One place it breaks is here in the anonymous constructor a => new { a.Region, a.PersonNumber }. The join fails saying the two anonymous types 'cannot be inferred from the usage' because it can't convert int to int? implicitly. And if I try to box or convert PersonNumber to a non-nullable int, or use .Value on it, the anonymous constructor fails since you can only use standard simple types to construct anonymous objects.

Is there a way to do this with EF Core 3.1? Maybe by tricking EF to convert both values to string somehow (which I've been able to do in the DB queries)?

CodePudding user response：

The common scenario I see where people encounter this problem is due to improper normalization. The requirement will be something like we want an achievement that could be associated to a Person or a Company as an example, so we either get something like:

[Achievement]
AchievementId (PK)
PersonId (FK, Nulable)
CompanyId (FK, Nullable)

[Achievement]
AchievementId (PK)
LinkId 
LinkType /* I.e. PERSON or COMPANY */

Both of these designs have potentially serious issues, especially as a system grows.

The first case can be managed using navigation properties in EF then querying achievements for a specific person:

var achievements = _context.Achievements
    .Include(a => a.Person)
    .Where(a => a.Person.PersonNumber == personNumber && a.Person.Region == region)
    .ToList();

From there you can access the Person using achievement.Person. Alternatively if you needed that anonymous type:

var data = _context.Achievements
    .Where(a => a.Person.PersonNumber == personNumber && a.Person.Region == region)
    .Select new {Achievement = a, Person = a.Person})
    .ToList();

EF should be able to work these relationships out without too much trouble, except there won't be anything in the schema or entities to ensure that the same Achievement couldn't be assigned to both a Person and a Company. Any restriction like that would need to be enforced in code.

The second option is fairly common, but honestly the worse of the two scenarios. This is where something like a LinkId or ParentId holds a value that might be a Person ID or a Company ID. While this does help ensure that an Achievement might only be associated with one or the other (and would solve your Type mismatch), the bigger issue is that you cannot set up relational FKs for that relationship, so nothing in the database actually guarantees that the LinkId actually does point to a record. If a LinkType happens to get tampered with due to a bug or other malicious bit of code/SQL you could see an Achievement pointed at a completely unrelated entity. Performance wise as well, without the FK this is going to start crawling as a system grows and all achievements end up in the same table.

The other consideration is guaranteeing that achievements will not have conditional values depending on whether they are associated with a Person or a Company or a ...(whatever might be added in the future) It makes sense for an achievement to be a single table where all achievements are equal in terms of the data they require. If an achievement associated to a Person is expected to have a certain set of fields populated, but those fields are not required for an achievement associated to a Company, then these entities/tables should be separated into something like PersonalAchievement and CorporateAchievement, even if 95% of the columns are identical.

The attraction of consolidating data into a single table comes from a time when storage space was expensive. But even then you were trading between different evils in design. Holding 100,000 rows in one table vs. 20,000 rows and 80,000 rows in two tables are effectively indistinguishable from a storage capacity, but considerably different from an indexing and constraint perspective. Using separate tables means you can enforce applicable constraints and use matching Keys. Consolidating tables is like polymorphism, it makes sense where multiple instances can be treated identically the same, not just similarly the same.