I'm trying to build a pandas df with an SQL postgres timescale db query and this works below but the problem is the query builds an enourmous file for the one point I am after /5231/REGCHG total_power
. The data has been recording for a few years on 1 minute intervals and I can get it all ordered nicely by the ORDER BY minute
but I dont need that much data. I only need to go back 3 days from the current day.
Im trying to average the data by 15 minute increments so 3 days in 15 minute increments is the last 288 rows of this query below. Would a quick hack be just returning the tail(288)
of this query if possible in SQL? Not a lot of wisdom here any tips appreciated.
I was also trying to use the #ORDER BY minute limit 3000
which I have commented out because it would only limit the first 3000 entries or when the data first started recording I need the last entries not the first hopefully this makes sense.
import psycopg2
from sqlalchemy import create_engine
import pandas as pd
import datetime
SQL_PASS = 'asdf'
SQL_USER = 'asf'
SQL_HOST = '10.10.10.10'
dsn = f'postgresql://{SQL_USER}:{SQL_PASS}@{SQL_HOST}:5432/my_db'
dbengine = create_engine(dsn)
qry = '''
SELECT
time_bucket('15 minute', "time") AS minute,
metric,
avg(value)
FROM slipstream_volttron
WHERE metric LIKE '/5231/REGCHG total_power'
GROUP BY minute, metric
ORDER BY minute
'''
#ORDER BY minute limit 3000
dbconn = psycopg2.connect(host=SQL_HOST, dbname='my_db', user=SQL_USER, password=SQL_PASS, connect_timeout=5)
t1 = datetime.datetime.utcnow()
df = pd.read_sql(qry, dbconn, index_col='minute', parse_dates=True)
t2 = datetime.datetime.utcnow()
duration = t2 - t1
seconds = duration.total_seconds()
hours = seconds // 3600
minutes = (seconds % 3600) // 60
seconds = seconds % 60
elapsed_time = f' db retrieval time is {minutes} minutes, {seconds} seconds'
print(df.columns)
print(df.head())
print(df.describe())
df.to_csv('main_meter_data.csv')
dbconn.close()
CodePudding user response:
I need the last entries not the first
So use DESCENDING
sort order:
SELECT time_bucket('15 minute', "time") AS minute
, avg(value)
FROM (
SELECT time, value
FROM slipstream_volttron
WHERE metric = '/5231/REGCHG total_power'
ORDER BY time DESC -- !!!
LIMIT 4320 -- 24 * 60 * 3
) sub
GROUP BY minute
ORDER BY minute DESC;
And limit in a subquery before you aggregate. This way, an index on (metric, time)
gives you a (bitmap) index scan. Append INCLUDE (value)
to the index if that gives you index-only scans. See:
And maybe add something like
WHERE time > '2023-1-1’
to the index to make it a very small partal index. And mirror rhe same filter in the query (logically redundantly) to match the index.
Since Postgres 14 you can also use date_bin()
from core Postgres instead of time_bucket()
. See:
CodePudding user response:
As suggested above, include a where clause on the time column:
WHERE time > now() - interval '3 days'
TimescaleDB will be able to use this predicate to more efficiently truncate which chunks it plans/ executes against.