I am trying to do a full load a very huge table (600 million records) which resides in an Oracle On-Prem database. My destination is Azure Synapse Dedicated Pool.
I have already tried following:
Using ADF Copy activity with Source Partitioning, as source table is having 22 partitions
I increased the Copy Parallelism and DIU to a very high level
Still, I am able to fetch only 150 million records in 3 hrs whereas the ask is to complete the full load in around 2 hrs as the source would be freezed to users during that time frame so that Synapse can copy the data
How a full copy of data can be done from Oracle to Synapse in that time frame?
For a change, I tried loading data from Oracle to ADLS Gen 2, but its slow as well
CodePudding user response:
There are a number of factors to consider here. Some ideas:
- how fast can the table be read? What indexing / materialized views are in place? Is there any contention at the database level to rule out?
- as you are on-premises, what is the local network card setup and throughput?
- as you are on-premises, you must be using a Self-hosted Integration Runtime (SHIR). What is the spec of this machine? eg 8GB RAM, SSD for spooling etc as per the minimum specification. Where is this located? eg 'near' the datasource (in the same on-premises network) or in the cloud. It is possible to scale out SHIRs by having up to four nodes but you should ensure via the metrics available to you that this is a bottleneck before scaling out.
- is the SHIR software version up-to-date? This gets updated occasionally so it's good practice to keep it updated.
- do you have Express Route or going across the internet? ER would probably be faster
- you should almost certainly land directly to ADLS Gen 2 or blob storage. Going straight into the database could result in contention there and you are dealing with Synapse concepts such as transaction logging, DWU, resource class and queuing contention among others. View the metrics for the storage in the Azure portal to determine it is under stress. If it is under stress (which I think unlikely), consider multiple storage accounts
- what format are you landing in the lake? Converting to parquet is quite compute intensive for example. Landing to the lake does leave an audit trail and give you a position to recover from if things go wrong
- ultimately the best thing to do would be one big bulk load (say taking the weekend) and then do incremental upserts using a CDC mechanism. This would allow you to meet your 2 hour window.
In summary, it's probably your network but you have a lot of investigation to do first, and then a number of options I've listed above to work through.
CodePudding user response:
wBob provided a good summary of things you good look at to increase your transfer speed. In addition to that, you could try to bulk export your data into chunks of data files, and in-parallel transfer the files to azure datalake or azure blob storage, this way you can maximize your network throughput.
Once the data is on the datalake, you can scale up your Synapse instance and take advantage of fast loads using the COPY command.
I faced the same problem in our organization, and the fastest way to get the data out of SQL Server was using bcp into a fast storage layer.