About two year ago I did a post about alternatives for the poorly performing SCD Transformation in SSIS. And even in SSIS 2012 it is still unusably slow for larger data sets. In that post I showed a couple of alternatives, but I didn't workout the T-SQL Merge solution.
Solution
A small recap of the alternatives in that post:
- The SSIS Dimension Merge SCD from Pragmatic Works which is according to the developers about a 100 times faster.
- And there are lot of third party components that can do the trick like CozyRoc Table Difference ($), the Pragmatic Works Upsert Destination ($), Konesans Checksum Transformation and our own Checksum Transformation.
- And of course the standard Lookup Transformation solution which works quite well for 2005.
- But if you have SSIS 2008 or later, then you could also use the TSQL Merge statement. Let's elaborate that solution.
T-SQL Merge solution
For this example I use a text file with client information that I want to use to populate my Client dimension. To keep it simple I just use 4 columns: BusinessKey, Title, Firstname and Lastname.
1) Variables
For logging purposes we need to add three integer variables: staged, inserts, updates:
Variables for logging |
2) Staging
Next step is to load the flat file with clients into a staging table with a Data Flow Task. The staging table is truncated before loading and the number of records in the Data Flow Task is stored in the variable staged with a rowcount transformation. It should look something like this below:
3) Execute SQL Task
Now we have two tables (staging and dimension) and we need to merge them:
Staging and dimension |
For this we need to add an Execute SQL Task.
Execute SQL Task |
4) Execute SQL Task - General
Edit the task and set the ResultSet to Single row, select the OLE DB connection and enter the MERGE query. The MERGE query below does INSERTS and UPDATES only. We don't delete records from the dimension table. For logging purposes I added a count of the number of inserts and updates.
General pane |
The query to use:
-- Create a table variable to store the number of updates and inserts DECLARE @tableVar TABLE (MergeAction VARCHAR(20)) -- The actual MERGE statement MERGE dwh.Dim_Client as MyTarget USING ( SELECT BusinessKey , Title , FirstName , LastName FROM staging.Clients ) as MySource ON MyTarget.BusinessKey = MySource.BusinessKey WHEN MATCHED AND NOT ( MySource.Title = ISNULL(MyTarget.Title, '') AND MySource.FirstName = ISNULL(MyTarget.FirstName, '') AND MySource.LastName = ISNULL(MyTarget.LastName, '') ) THEN -- BusinessKey of client exists, but data is different UPDATE Set MyTarget.Title = MySource.Title , MyTarget.FirstName = MySource.FirstName , MyTarget.LastName = MySource.LastName WHEN NOT MATCHED BY TARGET THEN -- BusinessKey of client does not exist INSERT (BusinessKey, Title, FirstName, LastName) VALUES (MySource.BusinessKey, MySource.Title, MySource.FirstName, MySource.LastName) OUTPUT $action INTO @tableVar; -- Get the number of inserts and updates on one line SELECT SUM(Inserted) as Inserted , SUM(Updated) as Updated FROM ( -- Count the number of inserts SELECT COUNT(*) as Inserted, 0 as Updated FROM @tableVar WHERE MergeAction = 'INSERT' UNION ALL -- Count the number of updates SELECT 0 as Inserted, COUNT(*) as Updated FROM @tableVar WHERE MergeAction = 'UPDATE' ) as CountTable;
5) Execute SQL Task - Result Set
Go to the Result Set pane and select the insert and update variables for storing the counts.
Result Set pane |
6) The Result
For testing purposes I added a ridiculous number of clients in my text file and instead of logging to a table I added a MessageBox in a Script Task. My 4 year old, 4GB machine did quite well with the MERGE statement (around 30 second), but the SCD hadn't processed any records after two hours.
The result |
Hi Geplaatst,
ReplyDeleteIs there any way to know which column has been updated by the Script task? Could you please post the code which you have used in the task.
Thanks
Arun Sasi
Not sure to which Script Task you're referring, but you could create one WHEN MATCHED AND NOT (x=x) for each column, but could be a bit painful if there are a lot of columns.
DeleteUnfortunately, T-SQL does not allow for more than one WHEN MATCHED-clauses in a MERGE-statement, unless you issue one UPDATE and one DELETE action, in which case you are allowed up to two WHEN MATCHED-clauses. See http://technet.microsoft.com/en-us/library/bb510625.aspx
Delete@Daniel: thanks for the info
Delete