Return the last n entries-CodePudding

I'm trying to build a pandas df with an SQL postgres timescale db query and this works below but the problem is the query builds an enourmous file for the one point I am after /5231/REGCHG total_power. The data has been recording for a few years on 1 minute intervals and I can get it all ordered nicely by the ORDER BY minute but I dont need that much data. I only need to go back 3 days from the current day.

Im trying to average the data by 15 minute increments so 3 days in 15 minute increments is the last 288 rows of this query below. Would a quick hack be just returning the tail(288) of this query if possible in SQL? Not a lot of wisdom here any tips appreciated.

I was also trying to use the #ORDER BY minute limit 3000 which I have commented out because it would only limit the first 3000 entries or when the data first started recording I need the last entries not the first hopefully this makes sense.

import psycopg2
from sqlalchemy import create_engine
import pandas as pd
import datetime


SQL_PASS = 'asdf'
SQL_USER = 'asf'
SQL_HOST = '10.10.10.10'
dsn = f'postgresql://{SQL_USER}:{SQL_PASS}@{SQL_HOST}:5432/my_db'
dbengine = create_engine(dsn)

qry = '''
SELECT
  time_bucket('15 minute', "time") AS minute,
  metric,
  avg(value)
FROM slipstream_volttron
WHERE metric LIKE '/5231/REGCHG total_power' 
GROUP BY minute, metric
ORDER BY minute
'''
#ORDER BY minute limit 3000

dbconn = psycopg2.connect(host=SQL_HOST, dbname='my_db', user=SQL_USER, password=SQL_PASS, connect_timeout=5)


t1 = datetime.datetime.utcnow()
df = pd.read_sql(qry, dbconn, index_col='minute', parse_dates=True)
t2 = datetime.datetime.utcnow()

duration = t2 - t1
seconds = duration.total_seconds()

hours = seconds // 3600
minutes = (seconds % 3600) // 60
seconds = seconds % 60

elapsed_time = f' db retrieval time is {minutes} minutes, {seconds} seconds'

print(df.columns)
print(df.head())
print(df.describe())

df.to_csv('main_meter_data.csv')
dbconn.close()

CodePudding user response：

I need the last entries not the first

So use DESCENDING sort order:

SELECT time_bucket('15 minute', "time") AS minute
     , avg(value)
FROM  (
   SELECT time, value
   FROM   slipstream_volttron
   WHERE  metric = '/5231/REGCHG total_power' 
   ORDER  BY time DESC  -- !!!
   LIMIT  4320  -- 24 * 60 * 3
   ) sub
GROUP  BY minute
ORDER  BY minute DESC;

And limit in a subquery before you aggregate. This way, an index on (metric, time) gives you a (bitmap) index scan. Append INCLUDE (value) to the index if that gives you index-only scans. See:

Do covering indexes in PostgreSQL help JOIN columns?

And maybe add something like

WHERE time > '2023-1-1’

to the index to make it a very small partal index. And mirror rhe same filter in the query (logically redundantly) to match the index.

Since Postgres 14 you can also use date_bin() from core Postgres instead of time_bucket(). See:

What is the fastest way to truncate timestamps to 5 minutes in Postgres?

CodePudding user response：

As suggested above, include a where clause on the time column:

WHERE time > now() - interval '3 days'

TimescaleDB will be able to use this predicate to more efficiently truncate which chunks it plans/ executes against.