My SQL knowledge is limited and I'm not sure how to even phrase this question to find some answers.
My problem is: I have a database full of candlestick objects:
class Candle(Base):
__tablename__ = "candle"
id = Column(Integer, primary_key=True, autoincrement=True)
pool_id = Column(Integer, ForeignKey('pool.id'), index=True, nullable=False)
type = Column(String(2), nullable=False, index=True)
timestamp = Column(Integer, nullable=False, index=True)
open = Column(Float, nullable=False)
low = Column(Float, nullable=False)
high = Column(Float, nullable=False)
close = Column(Float, nullable=False)
Currently there are 4.1M candles in the database but I've found a bug in the system: The "open" property has to be exactly the same as the "close" property of the last existing candle (based on the timestamp) but currently that is not the case. I have to also respect their type ('1M', '1H', '1D') and pool_id (which there are thousands of).
Normally I would do this using Python and load candles for pool and type, check them one by one and update the ones that don't have equal values, but since the DB is now large, it would take a very long time. Our DB is in AWS and even if we ran it from there it would take a couple of days.
How can I write an SQL query (or a set of SQL queries) that would go through the candles one by one and check if the previous "close" is the same as current "open", and if not, updated them to make that true?
Any help very appreciated!
Edit: example data
id | pool_id | type | timestamp | open | low | high | close |
---|---|---|---|---|---|---|---|
1 | 25095 | 1M | 1642592640 | 0.7723 | 0.7723 | 0.7806 | 0.7732 |
2 | 25095 | 1M | 1642592700 | 0.7813 | 0.7557 | 0.7813 | 0.7813 |
As you can see, first candle close is 0.7732 but next candle open is 0.7813
For this example, the second candle values should be:
open: 0.7732, low: 0.7557, high: 0.7813 close: 0.7813.
edit 2: my idea of the sql would be:
UPDATE candle
SET open = LAG(close, 1) OVER (PARTITION BY type, pool_id ORDER BY timestamp)
WHERE id IN (
SELECT id
FROM candle c
WHERE c.open != LAG(c.close, 1) OVER (PARTITION BY type, pool_id ORDER BY timestamp)
);
CodePudding user response:
Something like this should work
WITH cte AS (
SELECT id, LAG(close, 1) OVER (PARTITION BY type, pool_id ORDER BY timestamp) prev_close
FROM candle
)
UPDATE c
SET c.open = c2.prev_close,
c.low = LEAST(c2.prev_close, c.open, c.low, c.high, c.close),
c.high = GREATEST(c2.prev_close, c.open, c.low, c.high, c.close)
FROM candle c
JOIN cte c2
ON c.id = c2.id
WHERE c2.prev_close IS NOT NULL
AND c.open != c2.prev_close;
CodePudding user response:
Your edit 2 query is conceptually very close, However, there are 2 flaws:
- Window functions (here lag) cannot be used in the
where
clause. This stems from Window functions being the last phase processed except the final sort. So by time to execute the Window function thewhere
clause is long over with. Google SQL Order of Operations for more. - When using the
update ... from
structure you should not repeat the table name being updated.
Correcting for these results in: (see demo)
with cdata(candle_id, prior_close) as
(select candle_id, lag(close, 1) over (partition by type, pool_id order by ephoc_tm)
from candles
) --select * from cdata;
update candles c
set open = cd.prior_close
, updated_ts = now()
from cdata cd
where c.candle_id = cd.candle_id
and cd.prior_close is not null
and cd.prior_close is distinct from c.open;
Cautions: (At least IMHO)
- Using a column name
timestamp
is a poor practice because it it a Postgres data type. Using data types as column names leads to confusion. Especially, when the columntimestamp
is not a timestamp data type. - Be cautious for column names
open
andclose
. While they are not reserved words for Postgres, they are reserved words in the SQL Standard.
Note: For demo I user ephoc_tm
instead of timestamp for column name. Also included additional test rows. I never trust a test with essentially a single item nor in which all cases meet the criteria being tested.