Home > Mobile >  Where is the tradeoff between a join with logic (ranges) versus a Join with an intermediate table?
Where is the tradeoff between a join with logic (ranges) versus a Join with an intermediate table?

Time:01-25

Say we have two tables. One table is a list of ranges, the other is a list of things that fall between those ranges. We can join based on logic (BETWEEN) or we can generate a table of possible joins, and join on that. The question is, at what point does it become more expensive to do one versus the other. Are there any blogs or documentation that can help me figure this out?

In the examples below the data is minimal, but in the actual problem I'm trying to work through each table has millions of records and the intermediate table would have an order of magnitude more. The goal is to understand the relative computational cost of these two options.

Example Logic Based Join:

--Table of Ranges
CREATE TABLE DateRanges (
  ID INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
  StartDate DATE,
  EndDate DATE
)

INSERT INTO DateRanges (StartDate, EndDate) VALUES ('1/1/2023', '1/5/2023')
INSERT INTO DateRanges (StartDate, EndDate) VALUES ('1/3/2023', '1/7/2023') --Overlaps The first
INSERT INTO DateRanges (StartDate, EndDate) VALUES ('2/1/2023', '2/3/2023')

--Table of Things to Join
CREATE TABLE ThingsToJoin (
  ID INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
  ThingDate DATE
)

INSERT INTO ThingsToJoin (ThingDate) VALUES ('1/4/2023')
INSERT INTO ThingsToJoin (ThingDate) VALUES ('2/2/2023')

GO
    
-- Logic Based Join
SELECT
  D.*,
  T.*
FROM
  DateRanges D
INNER JOIN
  ThingsToJoin T
ON
  T.ThingDate BETWEEN D.StartDate AND D.EndDate

Example Intermediate Table Based Join:

CREATE TABLE IntermediateDates (
  DateRangeID INT NOT NULL FOREIGN KEY REFERENCES DateRanges(ID),
  DateRangeDate DATE NOT NULL
)
INSERT INTO IntermediateDates (DateRangeID, DateRangeDate) VALUES (1, '1/1/2023')
INSERT INTO IntermediateDates (DateRangeID, DateRangeDate) VALUES (1, '1/2/2023')
INSERT INTO IntermediateDates (DateRangeID, DateRangeDate) VALUES (1, '1/3/2023')
INSERT INTO IntermediateDates (DateRangeID, DateRangeDate) VALUES (1, '1/4/2023')
INSERT INTO IntermediateDates (DateRangeID, DateRangeDate) VALUES (1, '1/5/2023')
INSERT INTO IntermediateDates (DateRangeID, DateRangeDate) VALUES (2, '1/3/2023')
INSERT INTO IntermediateDates (DateRangeID, DateRangeDate) VALUES (2, '1/4/2023')
INSERT INTO IntermediateDates (DateRangeID, DateRangeDate) VALUES (2, '1/5/2023')
INSERT INTO IntermediateDates (DateRangeID, DateRangeDate) VALUES (2, '1/6/2023')
INSERT INTO IntermediateDates (DateRangeID, DateRangeDate) VALUES (2, '1/7/2023')
INSERT INTO IntermediateDates (DateRangeID, DateRangeDate) VALUES (3, '2/1/2023')
INSERT INTO IntermediateDates (DateRangeID, DateRangeDate) VALUES (3, '2/2/2023')
INSERT INTO IntermediateDates (DateRangeID, DateRangeDate) VALUES (3, '2/3/2023')

SELECT
  D.*,
  T.*
FROM
  DateRanges D
INNER JOIN
  IntermediateDates I
ON
  D.ID = I.DateRangeID
INNER JOIN
  ThingsToJoin T
ON
  T.ThingDate = I.DateRangeDate

CodePudding user response:

In my opinion the answer depends very much on the use case, the data and the amount of data you will have in future. So you won't end up with a clear answer here. That's why I recommend that you decide from a strategic point of view:

It is good practice, to keep OLTP use cases as flexible as possible, so use the BETWEEN approach here. But you should try to keep your model as strict as possible to avoid multiple rows when joining with BETWEEN if this is appropriate to your use case (not in your use case obviously).

In an OLAP use case, it is most of the time better to create clean relations by inserting a table where a row for each date exists (CROSS JOIN). The downside of having many rows isn't that relevant in those cases.

CodePudding user response:

As mentioned above, it all depends on your usage.

By using between you can limit your query to the specified date range dynamically. In comparison, the join would pop out every row that contains columns from both tables.

At the end of the day number of rows returned is what's going to determine the cost of bandwidth between the client and the server. So while in development try and test with both and see which one is more fruitful for future cases.

  • Related