Home > OS >  How to compare two temporally-distinct but otherwise identical tables in BigQuery/SQL?
How to compare two temporally-distinct but otherwise identical tables in BigQuery/SQL?

Time:10-30

I have two tables, curr and prev. Every day, curr is copied to prev and becomes the new prev. Then, curr is updated from an external source. The tables have the same schema. I"m using the BQ flavour of SQL.

The rough layout of the tables is a primary email as the unique key and then the record of a subscription. The same user (as identified by primary email) may have multiple subscriptions historically, although only one subscription will be currently active. Fields include things like subscription start date, subscription amount, cancellation status etc.

I am looking to SELECT only the differences between the two tables i.e. only the records which have recently changed their e.g. start date, amount, or cancellation status.

However, I'm struggling. I've tried SELECTing DISTINCT from the two tables joined where curr.start_date!=prev.start_date (OR any of the other fields of interest), but its just giving me the people who have previously held more than one subscription. There may be a way to return only the latest subscription using PARTITION BY and ROW ORDER but I'm not sure how to do this.

I've also tried SELECT* FROM curr EXCEPT DISTINCT SELECT* FROM prev, but I think this is only giving me new- rather than updated- rows.

Is anyone able to point me in the right direction? Many thanks in advance.

Some tabular data- both curr and prev look like this. In the case of an update, a new entry is created in curr, with the Primary_Email remaining the same.

primary_email start_date status cancellation_type amount frequency bundle
[email protected] 2020-10-14 Active Active 300 monthly a
[email protected] 2021-01-12 Cancelled Pre-start 12 monthly b
[email protected] 2019-08-14 Failure Post-start 25 annually b
[email protected] 2020-04-24 Active Active 555 monthly a
[email protected] 2021-02-12 Failure Post-start 15 annually c

My desired output is the same table (those fields from curr), but only the rows which are present in curr but not in prev.

CodePudding user response:

I think you were close. You could just add conditions to each of your select to make sure you only consider email addresses that are shared across both tables.

select * 
from curr 
where email in (select email from prev)

except 

select * 
from prev
where email in (select email from curr)
  • Related