How to update each row sequentially based on the last row in a single SQL statement?-CodePudding

My SQL knowledge is limited and I'm not sure how to even phrase this question to find some answers.

My problem is: I have a database full of candlestick objects:

class Candle(Base):
    __tablename__ = "candle"
    id = Column(Integer, primary_key=True, autoincrement=True)
    pool_id = Column(Integer, ForeignKey('pool.id'), index=True, nullable=False)
    type = Column(String(2), nullable=False, index=True)
    timestamp = Column(Integer, nullable=False, index=True)
    open = Column(Float, nullable=False)
    low = Column(Float, nullable=False)
    high = Column(Float, nullable=False)
    close = Column(Float, nullable=False)

Currently there are 4.1M candles in the database but I've found a bug in the system: The "open" property has to be exactly the same as the "close" property of the last existing candle (based on the timestamp) but currently that is not the case. I have to also respect their type ('1M', '1H', '1D') and pool_id (which there are thousands of).

Normally I would do this using Python and load candles for pool and type, check them one by one and update the ones that don't have equal values, but since the DB is now large, it would take a very long time. Our DB is in AWS and even if we ran it from there it would take a couple of days.

How can I write an SQL query (or a set of SQL queries) that would go through the candles one by one and check if the previous "close" is the same as current "open", and if not, updated them to make that true?

Any help very appreciated!

Edit: example data

id	pool_id	type	timestamp	open	low	high	close
1	25095	1M	1642592640	0.7723	0.7723	0.7806	0.7732
2	25095	1M	1642592700	0.7813	0.7557	0.7813	0.7813

As you can see, first candle close is 0.7732 but next candle open is 0.7813

For this example, the second candle values should be:

open: 0.7732, low: 0.7557, high: 0.7813 close: 0.7813.

edit 2: my idea of the sql would be:

UPDATE candle
SET open = LAG(close, 1) OVER (PARTITION BY type, pool_id ORDER BY timestamp)
WHERE id IN (
  SELECT id 
  FROM candle c 
  WHERE c.open != LAG(c.close, 1) OVER (PARTITION BY type, pool_id ORDER BY timestamp)
);

CodePudding user response：

Something like this should work

WITH cte AS (
    SELECT id, LAG(close, 1) OVER (PARTITION BY type, pool_id ORDER BY timestamp) prev_close
    FROM candle
)
UPDATE c
SET c.open = c2.prev_close, 
    c.low = LEAST(c2.prev_close, c.open, c.low, c.high, c.close), 
    c.high = GREATEST(c2.prev_close, c.open, c.low, c.high, c.close)
FROM candle c
JOIN cte c2
ON c.id = c2.id
WHERE c2.prev_close IS NOT NULL
AND c.open != c2.prev_close;

CodePudding user response：

Your edit 2 query is conceptually very close, However, there are 2 flaws:

Window functions (here lag) cannot be used in the where clause. This stems from Window functions being the last phase processed except the final sort. So by time to execute the Window function the where clause is long over with. Google SQL Order of Operations for more.
When using the update ... from structure you should not repeat the table name being updated.

Correcting for these results in: (see demo)

    with cdata(candle_id, prior_close) as 
         (select  candle_id, lag(close, 1) over (partition by type, pool_id order by ephoc_tm) 
            from  candles
         ) --select * from cdata; 
    update candles c
       set open = cd.prior_close
         , updated_ts = now()
      from cdata cd
     where c.candle_id = cd.candle_id 
       and cd.prior_close is not null
       and cd.prior_close is distinct from c.open;

Cautions: (At least IMHO)

Using a column name timestamp is a poor practice because it it a Postgres data type. Using data types as column names leads to confusion. Especially, when the column timestamp is not a timestamp data type.
Be cautious for column names open and close. While they are not reserved words for Postgres, they are reserved words in the SQL Standard.

Note: For demo I user ephoc_tm instead of timestamp for column name. Also included additional test rows. I never trust a test with essentially a single item nor in which all cases meet the criteria being tested.