I have 2 SQL Server tables that store network information, the EF schemas are:
public partial class edge
{
public long edge_id { get; set; }
public string source { get; set; }
public string target { get; set; }
public Nullable<System.DateTime> edgedate { get; set; }
}
public partial class node
{
public string node_id { get; set; }
public string name { get; set; }
public string address { get; set; }
}
I am passing edge and node specific filters from the UI to be built into SQL queries like this:
select *
from [dbo].[Nodes]
where name = 'John Doe'
or address = '123 Fake Street'
select *
from [dbo].[Edges]
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59
However these queries have to account for the entire network, i.e. node filters have to be applied to edges and vice versa -
-- nodes example with edge filters applied
select *
from [dbo].[Nodes]
where name = 'John Doe'
or address = '123 Fake Street'
and node_id in (select source
from EDGESTEMP
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59'
union
select target
from EDGESTEMP
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59')
This works fine on small scale networks, however if I am dealing with networks of say a million edges and 500k nodes the performance to run these queries takes a hit, because of the in statements when checking the other table in each instance.
I have added indexes on all the affiliated columns for the queries however need to know if there is a more efficient way of doing this?
Additional Info
Query plan - here
Clustered indexes are set on each primary key i.e. node_id and edge_id and non clustered set on the rest e.g. -
CREATE NONCLUSTERED INDEX [NonClusteredIndex-20211017-194859] ON [dbo].[NODES]
(
[name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO
CodePudding user response:
You can use exists for a more efficient query instead of that union.
Also, you shouldn't use ambiguous date literals. I was not sure if those dates were Jan 12 or Dec 1. Plus, for datetime range queries, you shouldn't use >= and <= but >= and <. You can see these adjusments in code:
select *
from [dbo].[Nodes] n
where (name = 'John Doe'
or address = '123 Fake Street')
and exists (select *
from EDGESTEMP e
where (n.node_id = e.source or n.node_id = e.target)
and e.edgedate >= '20200112'
and e.edgedate < '20210113');
BTW, I assume you already have indexes on source, target and edgedate. If not create them.
CodePudding user response:
Firstly, your query appears to have a logic error: there should be parenthesis around the or
.
Secondly, UNION ALL
is usually better than UNION
, although it often doesn't matter in a semi-join such as IN
or EXISTS
select n.*
from [dbo].[Nodes] n
where (n.name = 'John Doe'
or n.address = '123 Fake Street')
and node_id in (
select source
from EDGES
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59'
union all
select target
from EDGES
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59'
);
Finally, for this query, you should probably have the following indexes
NODES (name) INCLUDE (node_id)
NODES (address) INCLUDE (node_id)
EDGES (edgedate) INCLUDE (source, target)
The or
condition may still cause issues, as you may still get an index scan on NODES
. If so, you may need to rewrite the query to force an index union instead.
select n.*
from (
select *
from [dbo].[Nodes] n
where n.name = 'John Doe'
union
select *
from [dbo].[Nodes] n
where n.address = '123 Fake Street'
) n
where node_id in (
select source
from EDGES
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59'
union all
select target
from EDGES
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59'
);