Microsoft SQL Server Integration Services: SLOWLY CHANGING DIMENSION

Showing posts with label SLOWLY CHANGING DIMENSION. Show all posts

Friday, 27 May 2016

TSQL Merge for Slowly Changing Dimension or Persistent Staging Area

Case What is a fast way to load a Slowly Changing Dimension or Persistent Staging Area in SSIS. When using a Data Flow Task for this, the process could become very slow when there are a lot of updates.

Solution
Instead of using the OLE DB Command for updating records you could load all those records to a temporary table and then use a batch update command to update all records in the target table with the values from the temporary table.

An even fancier way is by using the TSQL MERGE statement. This statement has one downside and that is that it can only update records when there is a match and not update the old record and insert a new record. You can overcome this by using the output of the MERGE statement.

The MERGE statement will update the existing record in the destination table, but it can output the old version of the updated record. You can then use this output to do an insert on the destination table.

First create a source and destination table for testing purposes:

-- Drop if exist
IF OBJECT_ID('dbo.Employees', 'U') IS NOT NULL 
    DROP TABLE dbo.Employees; 

-- Create source table
CREATE TABLE [dbo].[Employees](
 [EmployeeNumber] [varchar](5) NULL,
 [FirstName] [varchar](50) NULL,
 [LastName] [varchar](50) NULL,
 [DateOfBirth] [date] NULL,
 [Salary] [money] NULL
);

-- Insert test records in source
INSERT [dbo].[Employees] ([EmployeeNumber], [FirstName], [LastName], [DateOfBirth], [Salary]) VALUES (N'00001', N'John',  N'Williams', CAST(N'1972-02-15' AS Date), 5100.00);
INSERT [dbo].[Employees] ([EmployeeNumber], [FirstName], [LastName], [DateOfBirth], [Salary]) VALUES (N'00002', N'Jane',  N'Smith',    CAST(N'1965-09-02' AS Date), 4900.00);
INSERT [dbo].[Employees] ([EmployeeNumber], [FirstName], [LastName], [DateOfBirth], [Salary]) VALUES (N'00003', N'Marc',  N'Brown',    CAST(N'1981-12-01' AS Date), 3300.00);
INSERT [dbo].[Employees] ([EmployeeNumber], [FirstName], [LastName], [DateOfBirth], [Salary]) VALUES (N'00004', N'David', N'Garcia',   CAST(N'1975-01-01' AS Date), 3700.00);

-- Drop if exist
IF OBJECT_ID('dbo.DimEmployee', 'U') IS NOT NULL 
    DROP TABLE dbo.DimEmployee;

-- Create destination table
CREATE TABLE [dbo].[DimEmployee](
 [EmployeeID] [int] IDENTITY(1,1) NOT NULL,
 [EmployeeNumber] [varchar](5) NULL,
 [FirstName] [varchar](50) NULL,
 [LastName] [varchar](50) NULL,
 [DateOfBirth] [date] NULL,
 [Salary] [money] NULL,
 [Active] [bit] NULL,
 [DateFrom] [datetime] NULL,
 [DateEnd] [datetime] NULL
);

The destination table is a Slowly Changing Dimension, but it could also be a Persistent Staging Area. It has the same columns as the source with a few extra columns:

EmployeeID is the dimension id it's populated by the identity setting. You could skip this column for the Persistent Staging Area.
Active: This is a Boolean to quickly filter all active records. It's a bit redundant, but also easy.
DateFrom: This is the datetime to indicate the insertion of this record. It never changes.
DateFrom: This is the datetime to indicate when this record was inactivated. I use NULL for active records, but you could also use a future date like '9999-31-12'

And now the MERGE script. Below I will try to describe each section that starts with a comment and number: /***** 1 *****/.

--Merge script
/***** 6b *****/
INSERT INTO DimEmployee 
   (EmployeeNumber
   , FirstName
   , LastName
   , DateOfBirth
   , Salary
   , Active
   , DateFrom
   , DateEnd)
SELECT  MergeOutput.EmployeeNumber
,   MergeOutput.FirstName
,   MergeOutput.LastName
,   MergeOutput.DateOfBirth
,   MergeOutput.Salary
,   0         -- InActivate the record
,   MergeOutput.DateFrom    -- Keep the old from date
,   GETDATE()       -- Close the record
FROM (
  /***** 1 *****/
  MERGE DimEmployee as T     -- Target
  USING Employees as S     -- Source
  ON T.EmployeeNumber = S.EmployeeNumber -- Compare key
  AND T.Active = 1      -- Only compare open records
  /***** 2 *****/
  WHEN NOT MATCHED BY TARGET THEN   -- Not found in destination
   INSERT 
   ( EmployeeNumber
   , FirstName
   , LastName
   , DateOfBirth
   , Salary
   , Active
   , DateFrom)
   VALUES
   ( S.EmployeeNumber
   , S.FirstName
   , S.LastName
   , S.DateOfBirth
   , S.Salary
   , 1         -- Activate the record
   , GETDATE())      -- Open the record
  /***** 3 *****/
  WHEN NOT MATCHED BY SOURCE    -- Not found in source
   AND T.Active = 1 THEN    -- Only compare open records
   UPDATE 
   SET T.Active  = 0    -- Inactivate record
   , T.DateEnd  = GETDATE()  -- Close date
  /***** 4 *****/
  WHEN MATCHED       -- Found in source and destination
   AND T.Active = 1     -- Only compare open records
   AND EXISTS (SELECT S.FirstName
      ,  S.LastName
      ,  S.DateOfBirth
      ,  S.Salary
      EXCEPT
      SELECT T.FirstName
      ,  T.LastName
      ,  T.DateOfBirth
      ,  T.Salary) THEN
   UPDATE
   SET T.FirstName  = S.FirstName
   , T.LastName  = S.LastName
   , T.DateOfBirth = S.DateOfBirth
   , T.Salary  = S.Salary
   , T.Active  = 1    -- Make record active
   , T.DateFrom  = GETDATE()  -- Open record with current datetime
   , T.DateEnd  = null   -- Keep record open
  /***** 5 *****/
  OUTPUT $action as MergeAction, Deleted.*, Inserted.Active as NewActiveCheck
  /***** 6a *****/
  ) as MergeOutput WHERE MergeAction = 'UPDATE' and NewActiveCheck = 1;

In this section you provide the name of the target and source table and which key to use to compare those records. I also added a filter on active to only compare open records. You could replace it by T.DateFrom is null.
This section is for new records. Source records that are not found in the destination will be inserted with Active set to 1 (true), the FromDate set to now and the EndDate set to null.
This section is for deleted records. Active destination records that are not found in the source are deactived and closed by setting the EndDate. Other columns remain unchanged.
This secion is for active updating record with new values. To prevent unnecessary updates I have added EXISTS-EXCEPT part. This is a very handy way to compare all (non key) columns for changes and above all it can even compare NULL values.
This is the last part of the MERGE statement and it can output the old and new values of deletes, updates and insertions. In this case I'm interested in the old values of the changed records (Deleted.* or Deleted.column1, Deleted.column2, etc). I also output the Active column from the new record to filter inactivated records (deleted records from the source shouldn't be inserted again). The $active indicates whether this is an 'INSERT', 'UPDATE', or 'DELETE'.
In 6a I filter on the action to only keep the old values of the updated records. In 6b I insert a new record with the old values of the changed records. I inactivate the new record and I set the EndDate to close the new record. Other columns remain unchanged.

Testing the script:

One update

Second test:

One update, one delete and one insert

I use this script primarily for the Persistent Staging Area. When you want to use it for an SCD you have to reload the facttable because the dimension ID changes. The fact pointing to ID 4 with David's old salary now points to the record with David's new salary.

The alternative script below could be a solution for that. Instead of comparing the key columns, I compare the CHECKSUM (or HASHBYTES) of all columns and remove the WHEN matched part (if the checksum matches, then we don't have to do anything). The benefit of this is that the dimension ID never changes. A second benefit is that you don't need to know the key columns. One downside is that CHECKSUM may not be unique and the HASHBYTES can only handle 8000bytes and can't compare NULL values. So the script below is NOT yet foolproof!!! Will work on that, but let me know if you have a solution.

--Alternative Merge script with checkum or hashbytes
/***** 6b *****/
INSERT INTO DimEmployee 
   ( EmployeeNumber
   , FirstName
   , LastName
   , DateOfBirth
   , Salary
   , Active
   , DateFrom
   , DateEnd)
SELECT  MergeOutput.EmployeeNumber
,   MergeOutput.FirstName
,   MergeOutput.LastName
,   MergeOutput.DateOfBirth
,   MergeOutput.Salary
,   0         -- InActivate the record
,   MergeOutput.DateFrom    -- Keep the old from date
,   GETDATE()       -- Close the record
FROM (
  /***** 1 *****/
  MERGE DimEmployee as T     -- Target
  USING Employees as S     -- Source
  ON    CHECKSUM(S.EmployeeNumber + '|' + S.FirstName + '|' + S.LastName + '|' + CAST(S.DateOfBirth as varchar(10))  + '|' + CAST(S.Salary as varchar(20))) = 
        CHECKSUM(T.EmployeeNumber + '|' + T.FirstName + '|' + T.LastName + '|' + CAST(T.DateOfBirth as varchar(10))  + '|' + CAST(T.Salary as varchar(20)))
  --ON    HASHBYTES('MD5 ', S.EmployeeNumber + '|' + S.FirstName + '|' + S.LastName + '|' + CAST(S.DateOfBirth as varchar(10))  + '|' + CAST(S.Salary as varchar(20))) = 
  --      HASHBYTES('MD5 ', T.EmployeeNumber + '|' + T.FirstName + '|' + T.LastName + '|' + CAST(T.DateOfBirth as varchar(10))  + '|' + CAST(T.Salary as varchar(20)))

  AND T.Active = 1      -- Only compare open records
  /***** 2 *****/
  WHEN NOT MATCHED BY TARGET THEN   -- Not found in destination
   INSERT 
   ( EmployeeNumber
   , FirstName
   , LastName
   , DateOfBirth
   , Salary
   , Active
   , DateFrom)
   VALUES
   ( S.EmployeeNumber
   , S.FirstName
   , S.LastName
   , S.DateOfBirth
   , S.Salary
   , 1         -- Activate the record
   , GETDATE())      -- Open the record
  /***** 3 *****/
  WHEN NOT MATCHED BY SOURCE    -- Not found in source
   AND T.Active = 1 THEN    -- Only compare open records
   UPDATE 
   SET T.Active  = 0    -- Inactivate record
   , T.DateEnd  = GETDATE()  -- Close date
  /***** 4 *****/
  /***** REMOVED *****/
  /***** 5 *****/
  OUTPUT $action as MergeAction, Deleted.*, Inserted.Active as NewActiveCheck
  /***** 6a *****/
  ) as MergeOutput WHERE MergeAction = 'UPDATE' and NewActiveCheck = 1;

Testing the script:

One update

Second test:

One update, one delete and one insert

Friday, 1 February 2013

T-SQL Merge in SSIS as SCD alternative

Case
About two year ago I did a post about alternatives for the poorly performing SCD Transformation in SSIS. And even in SSIS 2012 it is still unusably slow for larger data sets. In that post I showed a couple of alternatives, but I didn't workout the T-SQL Merge solution.

Solution
A small recap of the alternatives in that post:

The SSIS Dimension Merge SCD from Pragmatic Works which is according to the developers about a 100 times faster.
And there are lot of third party components that can do the trick like CozyRoc Table Difference ($), the Pragmatic Works Upsert Destination ($), Konesans Checksum Transformation and our own Checksum Transformation.
And of course the standard Lookup Transformation solution which works quite well for 2005.
But if you have SSIS 2008 or later, then you could also use the TSQL Merge statement. Let's elaborate that solution.

Before we start, there is also a MERGE Destination that uses a T-SQL Merge statement, but it is still beta (since 2008) and the reviews aren't that good. I did a small test my self, but stopped when I got some meaningless error messages.

T-SQL Merge solution
For this example I use a text file with client information that I want to use to populate my Client dimension. To keep it simple I just use 4 columns: BusinessKey, Title, Firstname and Lastname.

1) Variables
For logging purposes we need to add three integer variables: staged, inserts, updates:

Variables for logging

2) Staging
Next step is to load the flat file with clients into a staging table with a Data Flow Task. The staging table is truncated before loading and the number of records in the Data Flow Task is stored in the variable staged with a rowcount transformation. It should look something like this below:

3) Execute SQL Task
Now we have two tables (staging and dimension) and we need to merge them:

Staging and dimension

For this we need to add an Execute SQL Task.

Execute SQL Task

4) Execute SQL Task - General
Edit the task and set the ResultSet to Single row, select the OLE DB connection and enter the MERGE query. The MERGE query below does INSERTS and UPDATES only. We don't delete records from the dimension table. For logging purposes I added a count of the number of inserts and updates.

General pane

The query to use:

-- Create a table variable to store the number of updates and inserts
DECLARE @tableVar TABLE (MergeAction VARCHAR(20))

-- The actual MERGE statement
MERGE  dwh.Dim_Client as MyTarget
USING  (
   SELECT BusinessKey
   ,  Title
   ,  FirstName
   ,  LastName
   FROM staging.Clients
   ) as MySource
   ON MyTarget.BusinessKey = MySource.BusinessKey
   WHEN MATCHED AND NOT (
    MySource.Title   = ISNULL(MyTarget.Title, '')
    AND MySource.FirstName = ISNULL(MyTarget.FirstName, '')
    AND MySource.LastName = ISNULL(MyTarget.LastName, '')
   )
   THEN -- BusinessKey of client exists, but data is different
    UPDATE  
    Set MyTarget.Title  = MySource.Title
    , MyTarget.FirstName = MySource.FirstName
    , MyTarget.LastName = MySource.LastName
   WHEN NOT MATCHED BY TARGET
   THEN -- BusinessKey of client does not exist
    INSERT (BusinessKey, Title, FirstName, LastName)
    VALUES (MySource.BusinessKey, MySource.Title, MySource.FirstName, MySource.LastName)
OUTPUT $action INTO @tableVar;

-- Get the number of inserts and updates on one line
SELECT  SUM(Inserted) as Inserted
,   SUM(Updated) as Updated
FROM  (
   -- Count the number of inserts
   SELECT COUNT(*) as Inserted, 0 as Updated
   FROM @tableVar  
   WHERE MergeAction = 'INSERT'

   UNION ALL
   -- Count the number of updates
   SELECT 0 as Inserted, COUNT(*) as Updated
   FROM @tableVar  
   WHERE MergeAction = 'UPDATE'
   ) as CountTable;

5) Execute SQL Task - Result Set
Go to the Result Set pane and select the insert and update variables for storing the counts.

Result Set pane

6) The Result
For testing purposes I added a ridiculous number of clients in my text file and instead of logging to a table I added a MessageBox in a Script Task. My 4 year old, 4GB machine did quite well with the MERGE statement (around 30 second), but the SCD hadn't processed any records after two hours.

The result

Thursday, 13 January 2011

Lookup Slowly Changing Dimension type II

Case

I have a slowly changing dimension type II (with a start and end date). How do I create a working lookup in my dataflow?

Solution

There are roughly three options:

A) Override lookup query with custom query

B) Split dimension in separate days with datetime dimension

C) Split dimension in separate days in second table

Let’s say we have a fact for employee number 123456789 with fact date 12 December 2010 and we want to know the right EmployeeId.

Sample of Dim_Employee table:

EmpoyeeId	EmpoyeeNumber	Firstname	LastName	Department	Startdate	Enddate
1	123456789	John	Gilbert	DepartmentX	2004-01-01	2005-12-31
12	123456789	John	Gilbert	DepartmentZ	2006-01-01	2010-12-31
19	123456789	John	Gilbert	DepartmentA	2011-01-01	9999-12-31

Sample of the dataflow:

Partial dataflow

The lookup should return 12 as EmployeeId for this example.

Solution A
Override lookup query with custom query

A1) Query

Let’s have a closer look at the lookup query. Select the EmpoyeeId, EmpoyeeNumber, Startdate and Enddate from the employee dimension table (don't just select the whole table: see Performance Best Practices).

Lookup Query

A2) Select columns

Select the input columns to map the lookup columns. Now you see the problem: The factdate can either map the Startdate or Enddate. There is no between here. For now map the EmployeeNumber and the Factdate according to the sample and select EmployeeId as a new Column.

Column Mappings

A3) Custom Query

This step is a little different for SSIS 2005 and SSIS 2008 (and newer). Goto the Advanced Tab.

SSIS 2008:

In SSIS 2008 this tab is completely disabled with the message: This page is not used when Full cache mode is selected. To overcome this goto to the General tab and select Partial cache instead of Full Cache. Now you can modify the SQL Statement with the following query:

--Query with parameters
SELECT * 
FROM   (SELECT  EmployeeId
       ,        EmployeeNumber
       ,        StartDate
       ,        EndDate
       FROM     Dim_Employee) [refTable]
WHERE  [refTable].[EmployeeNumber] = ?
AND    [refTable].[StartDate] <= ?
AND    [refTable].[EndDate] > ?

Modify SQL Statement in SSIS 2008

SSIS 2005

For SSIS 2005: just Enable memory restriction and enter the query.

Modify SQL Statement in SSIS 2005

A4) Parameters
Now Push the Parameters button on the advanced tab to enter the mappings. Select FactDate (the date column in the source table) for both Parameter1 and Parameter2.

Parameters

A5) Result

Now your lookup is ready for testing.

The result: EmployeeId 12

The big big downside for this method is the lack of caching. You cannot use full cache. It will work for a small number records, but when the numbers grow, it will completely slow down your dataflow.

Solution B
Split dimension in separate days with datetime dimension
This solution is only possible if you have a time dimension.

B1) Query

Join the employee dimension and the time dimension, using between logic in the ON clause. This will result in a row for every dimension member for each day.

--Query with join
SELECT      Dim_Employee.EmployeeId
,           Dim_Employee.EmployeeNumber
,           Dim_Time.Date
FROM        Dim_Employee
INNER JOIN  Dim_Time
            ON Dim_Time.Date
            BETWEEN Dim_Employee.StartDate
            AND Dim_Employee.EndDate

The new query, join with the time dimension

B2) Select columns
Select the input columns to map the lookup columns. Unlike option A, the mapping is easy.

Column mapping

B3) Result
Now you can test the dataflow and see that the result is simulair to Solution A. But we have to narrow down the number of records to improve the performance, because there are over 2500 records for this one employee.

There are a couple of options. If your source data contain only records for the the current year you could change the query to:

--Only current year
SELECT      Dim_Employee.EmployeeId
,           Dim_Employee.EmployeeNumber
,           Dim_Time.Date
FROM        Dim_Employee
INNER JOIN  Dim_Time
            ON Dim_Time.Date
            BETWEEN Dim_Employee.StartDate
            AND Dim_Employee.EndDate
WHERE       YEAR(Dim_Time.Date) = YEAR(GETDATE())

Or you can use a MIN(date) query on your source data and use that in the where clause.

--Use minimum
SELECT      Dim_Employee.EmployeeId
,           Dim_Employee.EmployeeNumber
,           Dim_Time.Date
FROM        Dim_Employee
INNER JOIN  Dim_Time
            ON Dim_Time.Date
            BETWEEN Dim_Employee.StartDate
            AND Dim_Employee.EndDate
WHERE       Dim_Time.Date >= (SELECT  MIN(FactDate)
                              FROM    YourStaginTable))

And there are probably some more advanced queries to narrow down the number of records.

Solution C
Split dimension in separate days in second table.

If you don't have a time dimension table, you use a script (or query) to split all dimensions in separate days and copy those to a second table. Then use that second table in your lookup. And of cource try to narrow down the number of reconds just like in Solution B.

Conclusion
Every solution has it's pros and cons. The best solution for you depends on a number of things such as the number of records in your dimension and date spread in the fact records. Test it! Let me know if you found an other solution for your SCD Type II dimension lookup.

Alternatives without the Lookup Transformation
For large volumes of data there are a couple of alternatives:
1) Use a Merge Join Transformation instead of a lookup (join without the dates) and add a Conditional Split Transformation behind it that checks whether the date of the fact table is between the two dates of the dimension table.
2) The fasted option is to use a source query of an OleDB Source Component and do the between query of Solution A in SQL Server.

* Update *
Also see this Script Component solution from Matt Masson.

Friday, 7 January 2011

Slowly Changing Dimension Alternatives

Case
The Slowly Changing Dimension component works fine for small dimensions, but performance is very poor for slightly bigger dimensions. Is there an alternative?

Solution
There is a request on the Microsoft site to improve SCD in SQL 2012 which you can support, but intil then... some alternatives:

A) There is an open source project named SSIS Dimension Merge CSD Component which performs better (100x faster according the developers).

B) There is a third party table difference component that does the same trick, but faster. Downside is the yearly $400,-. An other (free) third party component is the Konesans Checksum Transformation which you can use to accomplish a SCD. And we made our own Checksum Transformation.

C) There is the TSQL Merge statement which has been added to Sql Server 2008. It Performs insert, update, or delete operations on a target table based on the results of a join with a source table.

D) And you can use the standard ssis components (lookups) to accomplish a faster SCD.

Lets elaborate option D to see the difference between the standard Slowly Changing Dimension Component and the lookups. Case: 15481 clients in a dimension table. Process same clients again with 100 new clients and 128 changed.

Slowly Changing Dimension Component solution
The standard solution with the Slowly Changing Dimension component looks something like this and takes about 25 seconds on average to complete on my laptop.

Slowly Changing Dimension component

Lookups solution
The solution with two lookups looks something like this and takes less than 1,5 seconds on average to complete. The first lookup only compares the business key. If the business key from the source doesn't exist in the dimension table, then it's a new record/client. All other records go to the second lookup which compares all attributes (incl the business key) to the dimension table. When no match is found, the client has been updated.

Two lookups

This performs very well if there are not too many updates. If I change the number of updates from 128 to 15422 records, the whole proces takes around two minutes with this method, which is only slightly faster than the standard SCD component.

Alternative for lot's of updates
An other alternative is to change the Update Clients to an insert statement which inserts all updated records into a second table. A lot of single update statements slow down the process. A fastload insert takes only a fragment of the time.

Change update to insert (into an other table)

After that you can use a batch-update to update all changed clients.

Execute SQL Statement

This alternative takes only six seconds in total. The batch-update query looks something like this.

--Batch update
UPDATE  [Dim_Client]
SET     [Dim_Client].Title          = [Dim_Client2].Title,
        [Dim_Client].FirstName      = [Dim_Client2].FirstName,
        [Dim_Client].LastName       = [Dim_Client2].LastName
FROM    [Dim_Client], [Dim_Client2]
WHERE   [Dim_Client].BusinessKey    = [Dim_Client2].BusinessKey

So there are alternatives for the standard SCD component. Which is best? That depends on the situation (are third party/open source dll's alowed on the production servers, how many updates do you expect, etc.). Let me know if you have an other alternative...

* Update: TSQL Merge added (suggested by Koen)*

Saturday, 1 January 2011

Microsoft SQL Server 2012 improvement suggestions

Recently CPT 1 of SQL 2012 (code-named 'Denali') was released. Although there were a lot of vey nice changes, there is still room for improvement. Two good initiatives from BI Monkey to improve SQL 2012: