Home > Mobile >  SQL Server sampling large volume of data per hour
SQL Server sampling large volume of data per hour

Time:01-25

Im using SQL Server 2016, and have a very large table containing millions of rows of data from different sources at irregular intervals over several years. The table cannot be altered, typical data looks like this -

Reading_ID    Source    Date                  Reading
==========    ======    ====                  =======
1             1         2023/01/01 00:04:00   7
2             1         2023/01/01 00:10:00   3
3             2         2023/01/01 00:15:00   8
4             1         2023/01/01 01:00:00   2
5             2         2023/01/01 01:03:00   15

The table has CONSTRAINT [PK_DATA_READINGS] PRIMARY KEY CLUSTERED ([Source] ASC, [Date] ASC). The SOURCE can be any number, its not fixed or known in advance. New sources can start at any time.

What I want to do is specify a date range and an interval in hours, then just get 1 reading from each source every X hours. i.e. in the above row 2 wouldn't be returned as its too close to row 1

I've tried something like the following -

DECLARE @Start_Date DATETIME = '2023/01/01 00:00:00',
        @End_Date DATETIME = '2023/02/01 00:00:00',
        @Interval_Hours = 4

;WITH HOURLY_DATA AS (
     SELECT d.Source, 
            d.Date, 
            d.Reading,
            ROW_NUMBER() OVER (PARTITION BY d.Source, DATEDIFF(HOUR, @Start_Date, d.DATE) / @Interval_Hours ORDER BY d.SOURCE, d.DATE) AS SOURCE_HOUR_ROW
     FROM data_readings d
     WHERE d.DATE BETWEEN @Start_Date AND @End_Date
)
SELECT h.Source,
       h.Date,
       h.Reading
FROM HOURLY_DATA h
WHERE h.SOURCE_HOUR_ROW = 1

But its still very slow to execute, sometimes taking 5 minutes or more to complete. I would like a faster way to get this data. I've looked at the Explain Plan, but cant see an obvious solution.

Thank you for looking.

CodePudding user response:

You say the Source column has no table that it correlates to. This significantly worsens performance options, as it means you have no way of skipping through your (Source, Date) index by date.

Ideally you would have a table containing a list of possible Source values using a foreign-key relationship. There is no reason why you couldn't update this dynamically.

However, you can hack it with an indexed view.

CREATE VIEW dbo.vAllSources
WITH SCHEMABINDING
AS

SELECT
  dr.Source,
  COUNT_BIG(*) AS Count
FROM dbo.data_readings dr
GROUP BY
  dr.Source;
CREATE UNIQUE CLUSTERED INDEX UX_AllSources ON AllSources (Source);

The server will efficiently maintain this index based off the original table.

Then you can do a simple join. Use the NOEXPAND hint to force it to use the index.

DECLARE @Start_Date DATETIME = '20230101 00:00:00',
        @End_Date DATETIME = '20230201 00:00:00',
        @Interval_Hours = 4;

WITH HOURLY_DATA AS (
    SELECT
      d.Source, 
      d.Date, 
      d.Reading,
      ROW_NUMBER() OVER (PARTITION BY d.Source, DATEDIFF(HOUR, @Start_Date, d.DATE) / @Interval_Hours ORDER BY d.DATE) AS SOURCE_HOUR_ROW
    FROM AllSources s
    JOIN data_readings d
      ON s.Source = d.Source
     AND d.DATE BETWEEN @Start_Date AND @End_Date
)
SELECT h.Source,
       h.Date,
       h.Reading
FROM HOURLY_DATA h
WHERE h.SOURCE_HOUR_ROW = 1;

Note that BETWEEN on date values is generally not recommended, as it implies >= AND <=. You are far better of using a half-open interval:

     AND d.DATE >= @Start_Date AND d.DATE < @End_Date

You should also use non-ambiguous date formats.

CodePudding user response:

The slowness is caused by the volume of data in the CTE. I found this solution which works faster How to sample records by time

  • Related