Microsoft SQL Server Integration Services: LOOKUP

Showing posts with label LOOKUP. Show all posts

Thursday, 13 January 2011

Lookup Slowly Changing Dimension type II

Case

I have a slowly changing dimension type II (with a start and end date). How do I create a working lookup in my dataflow?

Solution

There are roughly three options:

A) Override lookup query with custom query

B) Split dimension in separate days with datetime dimension

C) Split dimension in separate days in second table

Let’s say we have a fact for employee number 123456789 with fact date 12 December 2010 and we want to know the right EmployeeId.

Sample of Dim_Employee table:

EmpoyeeId	EmpoyeeNumber	Firstname	LastName	Department	Startdate	Enddate
1	123456789	John	Gilbert	DepartmentX	2004-01-01	2005-12-31
12	123456789	John	Gilbert	DepartmentZ	2006-01-01	2010-12-31
19	123456789	John	Gilbert	DepartmentA	2011-01-01	9999-12-31

Sample of the dataflow:

Partial dataflow

The lookup should return 12 as EmployeeId for this example.

Solution A
Override lookup query with custom query

A1) Query

Let’s have a closer look at the lookup query. Select the EmpoyeeId, EmpoyeeNumber, Startdate and Enddate from the employee dimension table (don't just select the whole table: see Performance Best Practices).

Lookup Query

A2) Select columns

Select the input columns to map the lookup columns. Now you see the problem: The factdate can either map the Startdate or Enddate. There is no between here. For now map the EmployeeNumber and the Factdate according to the sample and select EmployeeId as a new Column.

Column Mappings

A3) Custom Query

This step is a little different for SSIS 2005 and SSIS 2008 (and newer). Goto the Advanced Tab.

SSIS 2008:

In SSIS 2008 this tab is completely disabled with the message: This page is not used when Full cache mode is selected. To overcome this goto to the General tab and select Partial cache instead of Full Cache. Now you can modify the SQL Statement with the following query:

--Query with parameters
SELECT * 
FROM   (SELECT  EmployeeId
       ,        EmployeeNumber
       ,        StartDate
       ,        EndDate
       FROM     Dim_Employee) [refTable]
WHERE  [refTable].[EmployeeNumber] = ?
AND    [refTable].[StartDate] <= ?
AND    [refTable].[EndDate] > ?

Modify SQL Statement in SSIS 2008

SSIS 2005

For SSIS 2005: just Enable memory restriction and enter the query.

Modify SQL Statement in SSIS 2005

A4) Parameters
Now Push the Parameters button on the advanced tab to enter the mappings. Select FactDate (the date column in the source table) for both Parameter1 and Parameter2.

Parameters

A5) Result

Now your lookup is ready for testing.

The result: EmployeeId 12

The big big downside for this method is the lack of caching. You cannot use full cache. It will work for a small number records, but when the numbers grow, it will completely slow down your dataflow.

Solution B
Split dimension in separate days with datetime dimension
This solution is only possible if you have a time dimension.

B1) Query

Join the employee dimension and the time dimension, using between logic in the ON clause. This will result in a row for every dimension member for each day.

--Query with join
SELECT      Dim_Employee.EmployeeId
,           Dim_Employee.EmployeeNumber
,           Dim_Time.Date
FROM        Dim_Employee
INNER JOIN  Dim_Time
            ON Dim_Time.Date
            BETWEEN Dim_Employee.StartDate
            AND Dim_Employee.EndDate

The new query, join with the time dimension

B2) Select columns
Select the input columns to map the lookup columns. Unlike option A, the mapping is easy.

Column mapping

B3) Result
Now you can test the dataflow and see that the result is simulair to Solution A. But we have to narrow down the number of records to improve the performance, because there are over 2500 records for this one employee.

There are a couple of options. If your source data contain only records for the the current year you could change the query to:

--Only current year
SELECT      Dim_Employee.EmployeeId
,           Dim_Employee.EmployeeNumber
,           Dim_Time.Date
FROM        Dim_Employee
INNER JOIN  Dim_Time
            ON Dim_Time.Date
            BETWEEN Dim_Employee.StartDate
            AND Dim_Employee.EndDate
WHERE       YEAR(Dim_Time.Date) = YEAR(GETDATE())

Or you can use a MIN(date) query on your source data and use that in the where clause.

--Use minimum
SELECT      Dim_Employee.EmployeeId
,           Dim_Employee.EmployeeNumber
,           Dim_Time.Date
FROM        Dim_Employee
INNER JOIN  Dim_Time
            ON Dim_Time.Date
            BETWEEN Dim_Employee.StartDate
            AND Dim_Employee.EndDate
WHERE       Dim_Time.Date >= (SELECT  MIN(FactDate)
                              FROM    YourStaginTable))

And there are probably some more advanced queries to narrow down the number of records.

Solution C
Split dimension in separate days in second table.

If you don't have a time dimension table, you use a script (or query) to split all dimensions in separate days and copy those to a second table. Then use that second table in your lookup. And of cource try to narrow down the number of reconds just like in Solution B.

Conclusion
Every solution has it's pros and cons. The best solution for you depends on a number of things such as the number of records in your dimension and date spread in the fact records. Test it! Let me know if you found an other solution for your SCD Type II dimension lookup.

Alternatives without the Lookup Transformation
For large volumes of data there are a couple of alternatives:
1) Use a Merge Join Transformation instead of a lookup (join without the dates) and add a Conditional Split Transformation behind it that checks whether the date of the fact table is between the two dates of the dimension table.
2) The fasted option is to use a source query of an OleDB Source Component and do the between query of Solution A in SQL Server.

* Update *
Also see this Script Component solution from Matt Masson.

Wednesday, 22 December 2010

Performance Best Practices

Case
A client of mine had some performance issues with couple of SSIS packages and because they lack basic SSIS knowledge, they just upgraded there server with more memory. Finally, after 32GB of memory, they stopped upgrading and start reviewing there packages.

Solution
There are a lot of blogs about SSIS Best Practices (for instance: SSIS junkie). Here is the top 10 of the easy to implement but very effective ones I showed them to 'upgrade' their packages instead of the memory.

1) Unnecessary columns
Select only the columns that you need in the pipeline to reduce buffer size and reduce OnWarning events at execution time. SSIS even helps you by showing the unnecessary ones in the Progress/Execution Results Tab: [DTS.Pipeline] Warning: The output column "Address1" (16161) on output "Output0" (16155) and component "CRM clients" (16139) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance.

Unnecessary columns from a flat file

2) Use queries instead of tables
Following on the unnecessary columns, always use a SQL statement in an OLE DB Source component or (Fuzzy) Lookup component rather than just selecting a table. Selecting a table is akin to "SELECT *..." which is universally recognised as bad practice.

OLE DB Source, use SQL Command instead of Table

Lookup, use SQL Command instead of Table

3) Use caching in your LOOKUP
Make sure that the result of your lookup is unique, otherwise SSIS cannot cache the query and executes it for each record passing the lookup component. SSIS will warn you for this in the Progress/Execution Results Tab: [Lookup Time Dimension [605]] Warning: The component "Lookup Time Dimension" (605) encountered duplicate reference key values when caching reference data. This error occurs in Full Cache mode only. Either remove the duplicate key values, or change the cache mode to PARTIAL or NO_CACHE.

Watch out that you are not grabbing too many resources in the lookup. A couple of million records is probably not a good idea. And new is SSIS 2008 is that you can reuse your lookup cache in an other lookup.

SSIS 2008: Cache

4) Filter in source
Where possible filter your data in the Source Adapter rather than filter the data using a Conditional Split transform component. This will make your data flow perform quicker because the unnecessary records don't go through the pipeline.

Filter in OLE DB Source, filter data in source

5) Sort in source
A sort with SQL Server is faster than the sort in SSIS, partly because SSIS does the sort in memory. So it pays to move the sort to a source component (where possible). Note you have to set IsSorted=TRUE on the source adapter output, but setting this value does not perform a sort operation; it only indicates that the data it sorted. After that change the SortKeyPosition of all output columns that are sorted.

Advanced Editor for Source, sort data in source

6) Join in source
Where possible, join data in the Source Adapter rather than using the Merge Join component. SQL Server does it faster than SSIS. But watch out that you are not making to complex queries because that will worsen the readability.

Unnecessary Join and Sorts

7) Group in source
Where possible, aggregate your data in the Source Adapter rather than using the Aggregate component. SQL Server does it faster than SSIS.

Unnecessary Sorts, Join and Aggregate

8) Beware of Non-blocking, Semi-blocking and Fully-blocking components in general
The dataflow consists of three types of transformations: Non-blocking, Semi-blocking and Fully-blocking. And as the names suggests, use Semi-blocking and Fully-blocking components rarly to optimize your packages. Jorg Klein has written a interesting article about it with a list of which component is non-, semi- or fully blocking.

A summary of how to recognize these three types:

	Non-blocking	Semi-blocking	Fully-blocking
Synchronous/asynchronous	Synchronous	Asynchronous	Asynchronous
Number of rows in equal to rows out	True	Usually False	Usually False
Collect all input before the can output	False	False	True
New buffer created?	False	True	True
New thread created?	False	Usually True	True

Find more information about (a)synchronous at Microsoft.

9) High Volumes of Data and indexes
Loading high volumes of data on a table with clustered and non-clustered indexes could take a lot of time.
The most important thing to verify is if all indexes are really used. SQL Server 2005 and 2008 provide information about index usage with to views: sys.dm_db_index_operational_stats and sys.dm_db_index_usage_stats. Drop all rarely used and unused indexes first. Experience teaches that there are often a lot of unnecessary indexes. If you are absolute sure that all remaining indexes are necessary you can drop all indexes before loading the data and to recreate them afterwards. The performance profit of that depends on the number of records. The higher the number of records the more profit you gain.

Drop and recreate indexes

10) SQL Server Destination Adapter vs OLE DB Destination Adapter
If your target database is a local SQL server database, the SQL Server Destination Adapter will perform much better than the OLE DB Destination Adapter. However the SQL Server Destination Adapter works only on a local machine and via Windows security. You have to be absolute sure that your database stays local in the future otherwise you mapping will not work when moving the database.

Note: this is not a complete list, but just a top 10 of easy to implement but very effective ones. Tell me if you have items that should be in the top 10 of Performance Best Practices!

Note: Besides the Performance Best Practice there also is a Development Best Practice.