If I am loading data in batches, and, as a part of the loading pipeline, I am joining to a very large reference table to grab some values, is there an advantage to creating a temp table out of that large reference table and using that during the batched data load?
I am wondering if during each batch the reference table has to be pulled off of disk, and by loading the table to a temp table I can keep it in memory and bypass that expensive disk operation. Or, what is best practice for using a very large reference table during batched loads? BTW the table is about 3 gigs so big but not too big to keep in memory.
CodePudding user response:
is there an advantage to creating a temp table out of that large reference table and using that during the batched data load
No.
The data in SQL Server is organised into 8kb chunks, called "pages". SQL Server will always satisfy a query by performing operations against pages in RAM. If the pages it needs are not in RAM, it will pull them from the disk to RAM, and then perform the operations in RAM. Those pages will then stay in RAM.
"Forever".
Except... what if we don't have enough RAM to store all of the data?
If SQL Server needs a page that is not currently in RAM, but there is no more RAM available for it to use, it has to clear out some other page to make room for this new page. It has an algorithm for deciding which page should be cleared out, roughly based on what has been "least recently used" (technically LRU2, I believe).
The end result is this: If you have just finished reading a bunch of rows from VeryLargeReferenceTable
, and all of the pages needed to satisfy that query were able to fit in RAM, then as long as SQL isn't forced to flush your pages out of RAM because of other queries being run, all of that data for VeryLargeReferenceTable
is still in RAM when you run your next batch.
Now, what if you create the temp table as a copy of VeryLargeReferenceTable
(henceforth VLRT)?
SQL Server clearly has to read the data out of VLRT to do that - which means getting the VLRT pages into RAM! That's what it would have had to do if we just joined to VLRT directly!
But it also has to write all the data from VLRT to the temp table. So it also has to allocate new pages for the temp table in RAM! And these are going to be "dirty" pages (they've been modified), so if we start getting memory pressure we might have to write them to disk! Yikes.
In addition to this, you probably (hopefully!) have a useful index on VLRT for this query, yes? We want to create that index on the temp table as well, right? More page allocations, more page IO, more CPU time, to build that index.
So by using the temp table we did everything we had to do without the temp table, plus a whole lot more.