Initial Load on Day 1
id | key | fkid |
---|---|---|
1 | 0 | 100 |
1 | 1 | 200 |
2 | 0 | 300 |
Load on Day 2
id | key | fkid |
---|---|---|
1 | 0 | 100 |
1 | 1 | 200 |
2 | 0 | 300 |
3 | 1 | 400 |
4 | 0 | 500 |
Need to find delta records Load on Day 2
id | key | address |
---|---|---|
3 | 1 | 400 |
4 | 0 | 500 |
Problem Statement Need to find delta records in minimum time with following facts 1: I have to process around 2 billion records initially from a table as mentioned below 2: Also need to find delta with minimal time so that I can process it quickly
Questions : 1: Will it be a time consuming process to identify delta especially during production downtime ? 2: How long should it take to identify delta with 3 numeric columns in a table out of which id & key forms a composite key.
Solution tried : 1: Use full join and extract delta with case nvl condition but looks to be costly.
nvl(node1.id, node2.id) id,
nvl(node1.key, node2.key) key,
nvl(node1.fkid, node2.fkid) fkid
FROM
TABLE_DAY_1 node1
FULL JOIN TABLE_DAY_2 node2 ON node2.id = node1.id
WHERE
node2.id IS NULL
OR node1.id IS NULL;```
CodePudding user response:
You need two separate statements to handle this, one to detect new & changed rows, a separate one to detect deleted rows.
While it is cumberson to write, the fastest comparison is field-by-field, so:
SELECT /* parallel(8) full(node1) full(node2) USE_HASH(node1 node) */ *
FROM table_day_1 node1,
table_day_2 node2
WHERE node1.id = node2.id( )
AND (node2.id IS NULL -- new rows
OR node1.col1 <> node2.col2 -- changed val on non-nullable col
OR NVL(node1.col3,' ') <> NVL(node2.col3,' ') -- changed val on nullable string
OR NVL(node1.col4,-1) <> NVL(node2.col4,-1) -- changed val on nullable numeric, etc..
)
Then for deleted rows:
SELECT /* parallel(8) full(node1) full(node2) USE_HASH(node1 node) */ node2.id
FROM table_day_1 node1,
table_day_2 node2
WHERE node1.id( ) = node2.id
AND node1.id IS NULL -- deleted rows
You will want to make sure Oracle does a full table scan. If you have lots of CPUs and parallel query is enabled on your database, make sure the query uses parallel query (hence the hint). And you want a hash join between them. Work with your DBA to ensure you have enough temporary space to pull this off, and enough PGA to at least handle this with a single pass workarea rather than multipass.