Inner join with multiple conditions-CodePudding

I am trying to join the following tables in EF Core (SQL Server)

Table: Data

Id	Name	ParentId	ChildId
1	Dina	100	101
2	Lucy	100	`NULL`
3	Sam	100	102
4	Ben	200	201
5	Max	200	202
6	Lily	300	`NULL`

Table: Security

Id	UserId	ParentId	ChildId
1	xyz	100	`NULL`
2	xyz	200	`NULL`
3	abc	100	101
4	abc	200	202

Foreign keys are set up. In addition to the mentioned tables, there is a "Parents" and a "Childs" table.

Business logic

A user can either have access on parent level (ChildId = null) or on Child level (ChildId not null). Example

User xyz has access to all Data records except to Lily (Data.Id = 6).
User abc has only access to Data records with ChildIds 101 and 202, i.e.:
- Dina (Id = 1)
- Max (Id = 5)

T-SQL query

SELECT
    *
FROM
    Data AS d
    INNER JOIN Security AS s ON
        (
            ( s.ParentId = d.ParentId AND s.ChildId IS NULL ) 
            OR
            ( s.ChildId = d.ChildId AND s.ChildId IS NOT NULL )
        )
        AND
        s.UserId = 'xyz'

What's the best way to translate this into an EFCore lambda expression? The examples I found so far are all doing a simple join with one column only.

CodePudding user response：

I don't think using EF itself is an appropriate choice here, because EF still only works well with relationships between tables based on key-tuples (i.e.: foreign keys and primary keys), but your ParentId and ChildId columns are not keys, and Linq (even when used outside of EF) does not make it easy to perform any kind of JOIN besides tuple-equality.

I think a better overall solution would be to define a FUNCTION in your database that implements the Data-Security match logic, and then you can use that UDF from within EF (don't use a PROCEDURE: table-valued FUNCTIONS are much simpler and most important: they can be composed inside other queries, even in Linq EF queries, which you can't do with PROCEDURES).

I don't know if you need a function that does a single access check (in which case use a scalar FUNCTION) or if you want a table-valued function that returns the set of permitted UserId values for a given Data.Id - or the set of permitted Data.Id values for a given UserId.

Nevertheless, without using a FUNCTION, and keeping things entirely in EF, means compromising your query to fit within the limitations of Linq's expressiveness - and EF's ability to generate SQL from that Linq expression (remember that EF and Linq are not the same thing).

So I think the best way is to make a UNION of two inner-queries (aka derived-tables): the first half is the Security.ChildId IS NULL results, the second half is the Security.ChildId IS NOT NULL results.

In SQL this would be:

SELECT
    d.*
FROM
    dbo.Data AS d
    INNER JOIN Security s ON
        s.ParentId = d.ParentId
WHERE
    s.ChildId IS NULL
    AND
    s.UserId = @userId
    
UNION ALL

SELECT
    d.*
FROM
    dbo.Data AS d
    INNER JOIN Security s ON
        s.ParentId = d.ParentId
        AND
        s.ChildId = d.ChildId
WHERE
    s.ChildId IS NOT NULL
    AND
    s.UserId = @userId

Note how in the above query the INNER JOIN ON expression is now an exact tuple equality and the other criteria (like s.ChildId IS NOT NULL and s.UserId = @userId) is moved to the WHERE part. This shouldn't make a difference to the query's execution plan, but I think it makes the query conceptually easier to understand.

And so because the JOIN criteria are now expressed only as tuple-equalities, it can now be converted to a Linq query.

public static async Task< List<Data> > GetDataThatUserCanAccessAsync( MyDbContext db, String userId, CancellationToken ct = default )
{
    IQueryable<Data> firstHalf = db.Data
        .Join(
            inner: db.Security,
            innerKeySelector: s => new { s.ParentId },
            outerKeySelector: d => new { d.ParentId },
            resultSelector: ( d, s ) => new { d, s }
        )
        .Where( t => t.s.ChildId == null )
        .Where( t => t.s.UserId == userId )
        .Select( t => t.d );

    IQueryable<Data> secondHalf = db.Data
        .Join(
            inner: db.Security,
            innerKeySelector: s => new { s.ParentId, s.ChildId },
            outerKeySelector: d => new { d.ParentId, d.ChildId },
            resultSelector: ( d, s ) => new { d, s }
        )
        .Where( t => t.s.ChildId != null )
        .Where( t => t.s.UserId == userId )
        .Select( t => t.d );

    IQueryable<Data> query = firstHalf.Concat( secondHalf );

    List<Data> results = await query.ToListAsync( ct ).ConfigureAwait(false);
    return results;
}

When Linq is translated to SQL, there are some subtleties to be aware-of:

SQL's UNION = Linq's .Concat.
SQL's UNION ALL = Linq's .Union.

I prefer to explicit parameter labels whenever I use .Join in Linq because I keep on forgetting the exact order of the parameters, but it can be shorted down to just this:

(I strongly recommend you look at the generated SQL in both cases: it should be identical for both, but might also be different... and because this is a JOIN on non-key columns it's important that you ensure your database's INDEX objects for the ParentId and ChildId are correct and the STATISTICS objects are maintained (you are rebuilding them every week, right?), otherwise these queries (even in pure SQL without EF) will likely run very slowly if either table has more than a few hundred rows.

public static async Task< List<Data> > GetDataThatUserCanAccessAsync( MyDbContext db, String userId, CancellationToken ct = default )
{
    List<Data> results = await db.Data
        .Join( db.Security, s => new { s.ParentId }, d => new { d.ParentId }, ( d, s ) => new { d, s } )
        .Where( t => t.s.ChildId == null && t.s.UserId == userId )
        .Select( t => t.d )
        .Concat(
            db.Data
            .Join( db.Security, s => new { s.ParentId, s.ChildId }, d => new { d.ParentId, d.ChildId }, ( d, s ) => new { d, s } )
            .Where( t => t.s.ChildId != null && t.s.UserId == userId )
            .Select( t => t.d )
        )
        .ToListAsync( ct )
        .ConfigureAwait(false);

    return results;

I'm really not a fan of how Linq's .Concat and .Union methods are extensions on IQueryable<T> whereas in SQL they're top-level operators - which results in aesthetically unpleasing asymmetrical queries.

CodePudding user response：

I was writing something like @Dai's LINQ answer, but I used query syntax instead (which I prefer over fluent syntax for joins). Adapting the query portion of @Dai's solution, it would look like this (note my comments):

    var firstHalf = 
       from d in db.Data
       join s in db.Security
       on s.ParentId equals d.ParentId
       where s.ChildId == null && s.UserId == userId
       select d; // or new { d, s } <-- if you want to capture both data and security parts

    var secondHalf = 
        from d in db.Data
        join s in db.Security
        on new { d.ParentId, d.ChildId }
        equals new { s.Parentid, s.ChildId }
        where s.ChildId != null // <-- this where may not be necessary if d.ChildId  must be nonnull
        select d; // or new { d, s } if you want both Data and Security parts

    var query = firstHalf.Concat(secondHalf);

Lastly, I will strongly suggest checking the generated SQL and its execution plan. I've been surprised by both. If you do not check, you may be surprised by the complexity of the generated SQL and performance differences.