Home > front end >  How to do an inner join rather than for each loop in SSIS?
How to do an inner join rather than for each loop in SSIS?

Time:01-02

On the ETL server I have a DW user table.

On the prod OLTP server I have the sales database. I want to pull the sales only for users that are present in the user table on the ETL server.

Presently I am using an execute SQL task to fetch the DW users into a SSIS System.Object variable. Then using a for each loop to loop through each item (userid) in this variable and via a data flow task fetch the OLTP sales table for each user and dump it into the DW staging table. The for each is taking long time to run.

I want to be able to do an inner join so that the response is quicker, but I cant do this since they are on separate servers. Neither can I use a global temp table to make the inner join, for the same reason.

I tried to collect the DW users into a comma separated string variable and then using it (via string_split) to query into OLTP, but this is also taking more time at the pre-execute phase (not sure why exactly) even for small number of users.

I also am aware of lookup transform but that too will result in all oltp rows to be brought into the dw etl server to test the lookup condition.

Is there any alternate approach to be able to do an inner join by taking the list of users into the source?

Note: I do not have write permissions on the OLTP db.

CodePudding user response:

Based on the comments, I think we can use a temporary table to solve this.

Can you help me understand this restriction? "Neither can I use a global temp table to make the inner join, for the same reason."

The restriction is since oltp server and dw server are separate so can't have global temp table common to both servers. Hope makes sense.

The general pattern we're going to do is

  1. Execute SQL Task to create a temporary table on the OLTP server
  2. A Data Flow task to populate the new temporary table. Source = DW. Destination = OLTP. Ensure Delay Validation = True
  3. Modify existing Data Flow. Modify source to be a query that uses the temporary table i.e. SELECT S.* FROM oltp.sales AS S WHERE EXISTS (SELECT * FROM #SalesPerson AS SP WHERE SP.UserId = S.UserId); Ensure Delay Validation = True

A long form answer on using temporary tables (global to set the metadata, regular thereafter) enter image description here

I have 67k as there are only 3 employees in the temporary table.

enter image description here

Reference package

enter image description here

But wait, there's more!

Close out visual studio, open it back up and try to touch something in the data flows. Suddenly, there are red Xs everywhere! Any time you close a data flow component, it fires a revalidate metadata operation and guess what, it can't do that as the connection to the temporary table is gone. The package will run fine, it will not throw VS_NEEDSNEWMETADATA but editing/maintenance becomes a pain.

If you switched from global temporary table to local, switch the table name variable's value back to a global and then run the define statement in SSMS. Once that's done, then you can continue editing the package.

I assure you, the local temporary table does work once you have the metadata set and you use queries via variables for source/destination.

enter image description here

CodePudding user response:

No need for the global temporary table hack, or the SET FMTONLY OFF hack (which no longer works).

Just specify the result set metadata in the SQL query with WITH RESULT SETS. eg

EXEC ('

create table #t
( 
  ID INT,
  Name VARCHAR(150),
  Number VARCHAR(15)
)
insert into #t (Id, Name, Number)
select object_id, name, 12
from sys.objects

select * from #t

')
WITH RESULT SETS
(
 ( 
  ID INT,
  Name VARCHAR(150),
  Number VARCHAR(15)
 ) 
) 

If you need to parameterize the query, there's a bit of a catch because there are some limitations in how SSIS discovers parameters. SSIS runs sp_describe_undeclared_parameters, which doesn't really work with batches that call sp_executesql, because sp_executesql has a very unique way it handles parameters, one which you couldn't replicate with a user stored procedure.

So to parameterize the query you'll either need to pass the parameter values into the query using the "query from variable" and SSIS expressions, or push all this TSQL into a stored procedure.

  • Related