Friday, 7 January 2011

Slowly Changing Dimension Alternatives

Case
The Slowly Changing Dimension component works fine for small dimensions, but performance is very poor for slightly bigger dimensions. Is there an alternative?

Solution
There is a request on the Microsoft site to improve SCD in SQL 2012 which you can support, but intil then... some alternatives:

A) There is an open source project named SSIS Dimension Merge CSD Component which performs better (100x faster according the developers).

B) There is a third party table difference component that does the same trick, but faster. Downside is the yearly $400,-. An other (free) third party component is the Konesans Checksum Transformation which you can use to accomplish a SCD. And we made our own Checksum Transformation.

C) There is the TSQL Merge statement which has been added to Sql Server 2008. It Performs insert, update, or delete operations on a target table based on the results of a join with a source table.

D) And you can use the standard ssis components (lookups) to accomplish a faster SCD.

Lets elaborate option D to see the difference between the standard Slowly Changing Dimension Component and the lookups. Case: 15481 clients in a dimension table. Process same clients again with 100 new clients and 128 changed.

Slowly Changing Dimension Component solution
The standard solution with the Slowly Changing Dimension component looks something like this and takes about 25 seconds on average to complete on my laptop.
Slowly Changing Dimension component













Lookups solution
The solution with two lookups looks something like this and takes less than 1,5 seconds on average to complete. The first lookup only compares the business key. If the business key from the source doesn't exist in the dimension table, then it's a new record/client. All other records go to the second lookup which compares all attributes (incl the business key) to the dimension table. When no match is found, the client has been updated.
Two lookups



















This performs very well if there are not too many updates. If I change the number of updates from 128 to 15422 records, the whole proces takes around two minutes with this method, which is only slightly faster than the standard SCD component.

Alternative for lot's of updates
An other alternative is to change the Update Clients to an insert statement which inserts all updated records into a second table. A lot of single update statements slow down the process. A fastload insert takes only a fragment of the time.
Change update to insert (into an other table)






















After that you can use a batch-update to update all changed clients.
Execute SQL Statement




















This alternative takes only six seconds in total. The batch-update query looks something like this.
--Batch update
UPDATE  [Dim_Client]
SET     [Dim_Client].Title          = [Dim_Client2].Title,
        [Dim_Client].FirstName      = [Dim_Client2].FirstName,
        [Dim_Client].LastName       = [Dim_Client2].LastName
FROM    [Dim_Client], [Dim_Client2]
WHERE   [Dim_Client].BusinessKey    = [Dim_Client2].BusinessKey

So there are alternatives for the standard SCD component. Which is best? That depends on the situation (are third party/open source dll's alowed on the production servers, how many updates do you expect, etc.). Let me know if you have an other alternative...

* Update: TSQL Merge added (suggested by Koen)*

2 comments:

  1. Is there a reason why you do a Lookup "Lookup other attributes" before doing the update (or insert to temp client table).

    If the attributes are different but the Businesskey exists wouldn't the Lookup fail for some of the updated records and get redirected to the Error Output?

    Just wondering...

    ReplyDelete
  2. @Joshuarn: the second lookup is to prevent unnecessary updates. If the lookup (2) doesn't fail, then the attributes didn't change. That's faster...

    Of course it can become slow for very large datasets (millions of rows and dozens columns). In that case you should test which is faster: leaving out the second lookup, or lot's of unnecessary updates.

    ReplyDelete

Please use the SSIS MSDN forum for general SSIS questions that are not about this post. I'm a regular reader of that forum and will gladly answer those questions over there.

All comments are moderated manually to prevent spam.