Home > Blockchain >  Bigquery: WHERE clause using column from outside the subquery
Bigquery: WHERE clause using column from outside the subquery

Time:09-23

New to Bigquery, and googling could not really point me to the solution of the problem.

I am trying to use a where clause in a subquery to filter and pick the latest row for each other row in the main query. In postgres I'd normally do it like this:

SELECT
    *
FROM
    table_a AS a
    LEFT JOIN LATERAL
    (
        SELECT
            score,
            CONCAT( "AB", id ) AS id 
        FROM
            table_b AS b
        WHERE
            id = a.company_id
            and 
            b.date < a.date
        ORDER BY
            b.date DESC 
        LIMIT
            1
    ) ON true

WHERE
    id LIKE 'AB%'

ORDER BY
    createdAt DESC

so this would essentially run the subquery against each row and pick the latest row from table B based on a given row's date from table A.

So if table A would have a row

id date
12 2021-05-XX

and table B:

id date value
12 2022-01-XX 99
12 2021-02-XX 98
12 2020-03-XX 97
12 2019-04-XX 96

It would have joined only the row with 2021-02-XX to table a.


In another example, with

Table A:

id date
15 2021-01-XX

Table B:

id date value
15 2022-01-XX 99
15 2021-02-XX 98
15 2020-03-XX 97
15 2019-04-XX 96

it would join only the row with date: 2020-03-XX, value: 97.

Hope that is clear, not really sure how to write this query to work

Thanks for help!

CodePudding user response:

You can replace some of your correlated sub-select logic with a simple join and qualify statement.

Try the following:

SELECT *
FROM table_a a
LEFT JOIN table_b b
  ON a.id = b.id
WHERE b.date < a.date
QUALIFY ROW_NUMBER() OVER (PARTITION BY b.id ORDER BY b.date desc) = 1

With your sample data it produces: enter image description here

This should work for both truncated dates (YYYY-MM) as well as full dates (YYYY-MM-DD)

CodePudding user response:

Something like below should work for your requirements

    WITH
  latest_record AS (
  SELECT
    a.id,
    value,b.date, a.createdAt
  FROM
    `gcp-project-name.data-set-name.A` AS a
  JOIN
    `gcp-project-name.data-set-name.B` b
  ON
    ( a.id = b.id
      AND b.date < a.updatedAt )
  ORDER BY
    b.date DESC
  LIMIT
    1 )
SELECT
  *
FROM
  latest_record

I ran this with table A as

enter image description here

and table B as

enter image description here

and get result

enter image description here

  • Related