Efficiently query nodes and edges using SQL-CodePudding

I have 2 SQL Server tables that store network information, the EF schemas are:

public partial class edge
{
    public long edge_id { get; set; }
    public string source { get; set; }
    public string target { get; set; }
    public Nullable<System.DateTime> edgedate { get; set; }
}

public partial class node
{
    public string node_id { get; set; }
    public string name { get; set; }
    public string address { get; set; }
}

I am passing edge and node specific filters from the UI to be built into SQL queries like this:

select * 
from [dbo].[Nodes] 
where name = 'John Doe' 
   or address = '123 Fake Street'  

select * 
from [dbo].[Edges] 
where edgedate >= '01/12/2020 00:00:00' 
  and edgedate <= '01/12/2021 23:59:59

However these queries have to account for the entire network, i.e. node filters have to be applied to edges and vice versa -

-- nodes example with edge filters applied
select * 
from [dbo].[Nodes] 
where name = 'John Doe' 
   or address = '123 Fake Street'  
   and node_id in (select source 
                   from EDGESTEMP 
                   where edgedate >= '01/12/2020 00:00:00' 
                     and edgedate <= '01/12/2021 23:59:59'
                   union 
                   select target 
                   from EDGESTEMP 
                   where edgedate >= '01/12/2020 00:00:00' 
                     and edgedate <= '01/12/2021 23:59:59')

This works fine on small scale networks, however if I am dealing with networks of say a million edges and 500k nodes the performance to run these queries takes a hit, because of the in statements when checking the other table in each instance.

I have added indexes on all the affiliated columns for the queries however need to know if there is a more efficient way of doing this?

Additional Info

Query plan - here

Clustered indexes are set on each primary key i.e. node_id and edge_id and non clustered set on the rest e.g. -

CREATE NONCLUSTERED INDEX [NonClusteredIndex-20211017-194859] ON [dbo].[NODES]
(
    [name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO

CodePudding user response：

You can use exists for a more efficient query instead of that union.

Also, you shouldn't use ambiguous date literals. I was not sure if those dates were Jan 12 or Dec 1. Plus, for datetime range queries, you shouldn't use >= and <= but >= and <. You can see these adjusments in code:

select * 
from [dbo].[Nodes] n
where (name = 'John Doe' 
   or address = '123 Fake Street')  
   and exists (select * 
               from EDGESTEMP e
                   where (n.node_id = e.source or n.node_id = e.target)
                         and e.edgedate >= '20200112' 
                         and e.edgedate < '20210113');

BTW, I assume you already have indexes on source, target and edgedate. If not create them.

CodePudding user response：

Firstly, your query appears to have a logic error: there should be parenthesis around the or.

Secondly, UNION ALL is usually better than UNION, although it often doesn't matter in a semi-join such as IN or EXISTS

select n.* 
from [dbo].[Nodes] n
where (n.name = 'John Doe' 
       or n.address = '123 Fake Street')
 and node_id in (
    select source
    from EDGES 
    where edgedate >= '01/12/2020 00:00:00' 
      and edgedate <= '01/12/2021 23:59:59'
    union all
    select target 
    from EDGES
    where edgedate >= '01/12/2020 00:00:00' 
      and edgedate <= '01/12/2021 23:59:59'
);

Finally, for this query, you should probably have the following indexes

NODES (name) INCLUDE (node_id)

NODES (address) INCLUDE (node_id)

EDGES (edgedate) INCLUDE (source, target)

The or condition may still cause issues, as you may still get an index scan on NODES. If so, you may need to rewrite the query to force an index union instead.

select n.* 
from (
    select *
      from [dbo].[Nodes] n
      where n.name = 'John Doe'
    union
    select *
      from [dbo].[Nodes] n
      where n.address = '123 Fake Street'
) n
where node_id in (
    select source
    from EDGES 
    where edgedate >= '01/12/2020 00:00:00' 
      and edgedate <= '01/12/2021 23:59:59'
    union all
    select target 
    from EDGES
    where edgedate >= '01/12/2020 00:00:00' 
      and edgedate <= '01/12/2021 23:59:59'
);