Tuesday 25 January 2011

Search for ColumnName

Case
I'm searching for the right column in a large Microsoft SQL database. Is there a way to find all columns with a certain name without manually browsing through all tables?

Solution
A colleague once gave me this handy query. It searches for column names in an entire Microsoft SQL database. An indispensable query when you have to do a little research while building SSIS packages.

--Search for column name in database
SELECT  tab.name as [Table/View]
,       CASE
            WHEN tab.xtype = 'V' THEN 'view'
            ELSE 'table'
        END as [Type]
,       col.name as [ColumnName]
,       typ.name as [Datatype]
FROM    dbo.syscolumns as col
,       dbo.sysobjects as tab
,       dbo.systypes as typ
WHERE   col.id = tab.id
AND     col.xtype = typ.xtype
AND     tab.xtype in ('V','U')
-- Pick on of these possibilities
--AND     col.name = 'version'
--AND     col.name like '%version%'

Select the right database and change commented out lines.
The result

Sunday 23 January 2011

Cast and Convert to avoid the Data Conversion Transformation

Case
The Data Conversion transformation works fine for converting for example a non-unicode string to an unicode string, but it has one annoying feature. It creates an extra column (with the prefix "Copy of ") which can be confusing. Is there an alternative?

Solution
There is, but only for database sources (not for flatfiles and suchlike). You can use the Cast or Convert statement in your source query for data conversions. It moves some of the 'workload' from SSIS to your database engine.
Data Conversion in source



















And of cource it also works in a Lookup


















--CAST (ANSI standard)
SELECT  CAST(someColumn AS nvarchar(50)) AS someColumn
FROM    someTable

--CONVERT (Microsoft specific)
SELECT  CONVERT(nvarchar(50), someColumn) AS someColumn
FROM    someTable

The difference between the Cast and Convert is that the CAST is an ANSI standard and the CONVERT is Microsoft specific and the CONVERT has an extra optional Style option. More information about that is available at MSDN.

Performance
My simple test case was processing a million records and convert a column via a CAST and via the standard SSIS component. And run those two different packages a thousand times.
Testing with a million records














After a thousand times the CAST had an average of 4 seconds and the Data Conversion transformation had an average of 5,3 seconds. A negligible result on simple test like this, that was not that thoroughly. But no annoying, unnecessary and confusing columns! :-)

What to do for the non-database sources?
Well, Todd McDermid developed a pattern to bypass the confusing extra columns. His workaround consists of using a Data Conversion transformation immediately followed by an Union All filtering the unnecessary columns and renaming the converted columns. And he even developed an open source replacement for the Data Conversion transformation if the CAST or CONVERT option is not possible for your source. But be warned. These solution will decrease the performance because they are asynchronous.

You could also support this feedback at Microsoft.com to just hide unnecessary columns.

Thursday 20 January 2011

Foreach folder enumerator

Case
I want a Foreach Folder Enumerator, but the Foreach Loop component only loops through files.

Solution
Unfortunately this isn't supported by the standard Foreach Loop component, but there are a couple of workarounds:
- Use an open source component Directory Foreach Enumerator or my own Custom Foreach Folder Enumerator
- Use a Script task to accomplish this.

This example uses the Script task and a Foreach Loop to enumerate through the folders. The Script tasks generates a list of Directories and the Foreach Loop loops through that list.

1) Control Flow
Drag a Script task and a Foreach Loop container to the Control Flow like the image below.
Control Flow




















2) Variables
Create three string variables:
  1. startlocation that indicates the parent folder wherein we gonna search for subfolders. Fill it with a path like "d:\foreachfoldertest\".
  2. xmldoc which we gonna use for communicating between the Script task and the Foreach Loop.
  3. folder which is used for the variable mapping in the Foreach Loop container.
Varibles (right click in Control Flow)










3) Script Task
Select the startlocation as a readonly variable (we only read the path inside it) and select xmldoc as a readwritevariable (an xml string will be stored in it).
Variables



















4) The Script
Copy the following script to your Script task
// C# code 
// This script loops through a folder
// and adds all subfolders to an xml string
using System;
using System.Data;
using System.IO;        // Added
using System.Xml;       // Added
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;

namespace ST_5da96344f1c4411ab56207579f2e5e91.csproj
{
    [System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
    public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
    {

        #region VSTA generated code
        enum ScriptResults
        {
            Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
            Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
        };
        #endregion

        public void Main()
        {
            // Get all subfolders within the parent folder
            string[] subFolders = Directory.GetDirectories(Dts.Variables["User::startlocation"].Value.ToString());

            // Create variables for the XML string
            XmlDocument xmldoc = new XmlDocument();
            XmlElement xmlRootElem;
            XmlElement xmlChildElem;
            XmlText xmltext;

            // Add the root element: <ROOT>
            xmlRootElem = xmldoc.CreateElement("", "ROOT", "");

            // Variable for directory information
            DirectoryInfo directoryInfo;

            // Loop through the parent folder
            foreach (string currentFolder in subFolders)
            {
                // Fill directoryInfo variable with folder information
                directoryInfo = new DirectoryInfo(currentFolder);

                // Create the child element that contains the path:
                // <Folder>d:\foreachfoldertest\subfolder1\</Folder>
                xmlChildElem = xmldoc.CreateElement("", "Folder", "");
                xmltext = xmldoc.CreateTextNode(directoryInfo.FullName);
                xmlChildElem.AppendChild(xmltext);

                // Add the child element to the root element
                xmlRootElem.AppendChild(xmlChildElem);
            }
            // Add the root element to the xml document
            xmldoc.AppendChild(xmlRootElem);
            
            // Store the xml in the SSIS variable
            Dts.Variables["xmldoc"].Value = xmldoc.InnerXml.ToString();

            Dts.TaskResult = (int)ScriptResults.Success;
        }
    }
}

This script will result in:
<ROOT>
  <Folder>d:\foreachfoldertest\subfolder1</Folder>
  <Folder>d:\foreachfoldertest\subfolder2</Folder>
  <Folder>d:\foreachfoldertest\subfolder3</Folder>
</ROOT>

5) Foreach Loop Container
Edit the Foreach Loop Container and change the enumerator on the Collection tab to Foreach NodeList Enumerator. This means that it will loop through an xml string. Change the DocumentSourceType to Variable and select the xmldoc variable as the DocumentSource. This is the variable that contains the xml string from the Script task. Last... add the following string to the OuterXPathString: "/ROOT/*" (without the quotes). This means that it's gonna look within the root element.
Foreach NodeList Enumerator



















6) Variable Mappings
Map the variable folder to Index 0. This will store the path in this variable.
Variable Mappings



















7) Test the result
Now you can add your own tasks to the Foreach Loop container and use the variable folder to get the path. I added a simple Script Task for testing that shows the path in a MessageBox.
Example with MessageBox














8) More advanced script
You can even add a Traverse subfolders option with this recursive method.
// C# code
// This script loops through a folder
// and adds all subfolders to an xml string
using System;
using System.Data;
using System.IO;        // Added
using System.Xml;       // Added
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;

namespace ST_5da96344f1c4411ab56207579f2e5e91.csproj
{
    [System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
    public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
    {

        #region VSTA generated code
        enum ScriptResults
        {
            Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
            Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
        };
        #endregion

        // Variables for the xml string
        private XmlDocument xmldoc;
        private XmlElement xmlRootElem;

        public void Main()
        {
            // Inialize XMLdoc
            xmldoc = new XmlDocument();

            // Add the root element: <ROOT>
            xmlRootElem = xmldoc.CreateElement("", "ROOT", "");

            // Add Subfolders as Child elements to the root element
            GetSubFolders(Dts.Variables["User::startlocation"].Value.ToString());

            // Add root element to XMLdoc
            xmldoc.AppendChild(xmlRootElem);

            // Fill SSIS variable with XMLdoc
            Dts.Variables["xmldoc"].Value = xmldoc.InnerXml.ToString();

            Dts.TaskResult = (int)ScriptResults.Success;
        }

        // Recursive method that loops through subfolders
        private void GetSubFolders(String parentFolder)
        {
            // Get subfolders of the parent folder
            string[] subFolders = Directory.GetDirectories(parentFolder);

            // XML child element
            XmlElement xmlChildElem;
            XmlText xmltext;

            // Variable for file information
            DirectoryInfo directoryInfo;

            // Loop through subfolders
            foreach (string currentFolder in subFolders)
            {
                // Fill fileInfo variable with file information
                directoryInfo = new DirectoryInfo(currentFolder);

                // Create child element "Folder":
                // <Folder>d:\foreachfoldertest\subfolder1\</Folder>
                xmlChildElem = xmldoc.CreateElement("", "Folder", "");
                xmltext = xmldoc.CreateTextNode(directoryInfo.FullName);
                xmlChildElem.AppendChild(xmltext);
                
                // Add child element to root element
                xmlRootElem.AppendChild(xmlChildElem);

                // Recursive call
                GetSubFolders(directoryInfo.FullName);
            }
        }
    }
}

Note: there are more options to accomplish the communication between the Script Task and the Foreach Loop container, but I wanted you to show an infrequently used option of the Foreach Loop container. Let me know what your solution was.

Saturday 15 January 2011

Use filedates in SSIS

Case
A while ago someone asked me if it was possible to use the actual filedate of a file in SSIS.

Solution
Well, a Script task can retrieve the creation and modify date of a file. In this example I will import the latest file in a folder and delete all files older than 5 days. And as a bonus I will add an audit column with the filedate. A lot of steps (may be a bit too much for the experienced users, but not for the novice ones).

1) Create variables
Create three variables named FilePath (string), FileDate (datetime) and SourceFolder (string). Fill the SourceFolder variable with an existing path where the sourcefiles are stored.
Variables










2) Script task
Drag a Script task and a Data Flow task to you Control Flow and give them suitable names.
Script and Data Flow task















3) Map variables in Script task
Edit the Script task and select the FilePath and FileDate as ReadWrite variables and the SourceFolder as ReadOnly variable.
Map variables



















4) The Script
Copy the code from the main method to your code and add the extra using.
// C# code 
// Script to get the last modified file
// Note: if there are two files with both
// the highest modified date, the script
// will pick a random one.
using System;
using System.Data;
using System.IO;    // Added
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;

namespace ST_9f1bb3d696fe44b9b5e50ee8f5f648b5.csproj
{
    [System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
    public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
    {

        #region VSTA generated code
        enum ScriptResults
        {
            Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
            Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
        };
        #endregion

        public void Main()
        {
            // Get all files in the SourceFolder
            string[] sourceFiles = Directory.GetFiles(Dts.Variables["User::SourceFolder"].Value.ToString());

            // Store highest date en file
            DateTime highestDate = new DateTime();
            string lastFile = string.Empty;

            // Variable for file information
            FileInfo fileInfo;

            // Loop through sourceFiles
            foreach (string currentFile in sourceFiles)
            {
                // Fill fileInfo variable with file information
                fileInfo = new FileInfo(currentFile);

                // Choose between creation date and lastmodified date
                // Compare the file date to the variable
                if (fileInfo.LastWriteTime > highestDate)
                {
                    // Fill variables if the current file is the newest
                    lastFile = currentFile;
                    highestDate = fileInfo.CreationTime;
                }
            }

            // Return values to the SSIS variables
            Dts.Variables["User::FilePath"].Value = lastFile;
            Dts.Variables["User::FileDate"].Value = highestDate;

            // Finish script
            Dts.TaskResult = (int)ScriptResults.Success;
        }
    }
}

5) Edit Data Flow
Create a Flat file source, Derived Column and a target of your choice. It should look something like this:
Data Flow






















6) Source File
Add a connection manager to your Flat File Source. Select one of the files in your source folder for this connection manager. It doesn't matter which one, because the connection string will be overwritten by an expression. Add the colums to finish it. Now go to the properties of your newly created connection manager and add an expression.
Right Click if you don't see the properties


Add an expression






















7) Expression
Add an expression for the ConnectionString Property so the ConnectionString will be overwritten by the value of the FilePath variable.
Create an expression




















8) Derived Column
Edit the Derived Column and add a new column that is filled with the variable FileDate.
New column with FileDate



















9) Target
Finish the dataflow by adding a target and test the first part of this mapping.
The test result

















Now the second part which deletes files older than 5 days. We could do this in one simple Script task, but I will try to limit the .Net code because this is an SSIS blog and not a .Net blog.


10) Delete old files
Add a Foreach Loop to the Control Flow with a Script Task and a File System Task in it. Your Control Flow should look something like this.
Continuing Control Flow






















11) Configure Foreach Loop
Edit the Foreach Loop and select the Foreach File Enumerator. After that enter the path of you sourcefolder.
Foreach File



















12) Expression
This step is not really necessary for this example, but it's just a bit nicer. Goto the Expressions on the Collection tab of the Foreach Loop and add an expression for the Directory which will overwrite the Folder property (don't let the different names confuse you) with the variable SourceFolder.
Add an expression for the Directory/Folder






















13) Variable Mapping
Map the FilePath variable to Index 0 in the Foreach Loop.
Variable Mappings



















14) Script task
Copy the code from the main method to your code and add the extra using. Don't fill in the variables this time (it's done by script).
// C# code 
// Fill SSIS variable with the last modified date
// of the current source file
using System;
using System.Data;
using System.IO;    // Added
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;

namespace ST_cb8dd466d98149fcb2e3852ead6b6a09.csproj
{
    [System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
    public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
    {

        #region VSTA generated code
        enum ScriptResults
        {
            Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
            Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
        };
        #endregion

        public void Main()
        {
            // Lock SSIS variables
            Dts.VariableDispenser.LockForRead("User::FilePath");
            Dts.VariableDispenser.LockForWrite("User::FileDate");

            // Create a variables 'container' to store variables
            Variables vars = null;

            // Add variables from the VariableDispenser to the variables 'container'
            Dts.VariableDispenser.GetVariables(ref vars);

            // Variable for file information
            FileInfo fileInfo;

            // Fill fileInfo variable with file information
            fileInfo = new FileInfo(vars["User::FilePath"].Value.ToString());

            // Choose between creation date and lastmodified date
            // Fill SSIS variable with last modified date
            vars["User::FileDate"].Value = fileInfo.LastWriteTime;

            // Release the locks
            vars.Unlock();
            
            Dts.TaskResult = (int)ScriptResults.Success;
        }
    }
}

15)  Precedence Constraint
Enter the following Contraint to the line between the Script Task and the File System Task:
DATEDIFF("dd", @[User::FileDate], @[System::StartTime] ) > 5
Date check in the Expression

















16) File System Task
Delete the file which is stored in the variabe FilePath. You need to enter some dummy default value in the variable. Otherwise you will get an error message: Variable "FilePath" is used as an source or destination and is empty.
Delete the old file



















17) The result
Now test the package with a couple of files.
Last file wasn't old.






















That's that! Let me know if you have some other interesting SSIS ideas with file properties.

Update: a whole bunch of file properties.

Thursday 13 January 2011

Lookup Slowly Changing Dimension type II

Case
I have a slowly changing dimension type II (with a start and end date). How do I create a working lookup in my dataflow?

Solution
There are roughly three options:
A) Override lookup query with custom query
B) Split dimension in separate days with datetime dimension
C) Split dimension in separate days in second table

Let’s say we have a fact for employee number 123456789 with fact date 12 December 2010 and we want to know the right EmployeeId.

Sample of Dim_Employee table:
EmpoyeeIdEmpoyeeNumberFirstnameLastNameDepartmentStartdateEnddate
1123456789JohnGilbertDepartmentX2004-01-012005-12-31
12123456789JohnGilbertDepartmentZ2006-01-012010-12-31
19123456789JohnGilbertDepartmentA2011-01-019999-12-31

Sample of the dataflow:
Partial dataflow













The lookup should return 12 as EmployeeId for this example.

Solution A
Override lookup query with custom query

A1) Query
Let’s have a closer look at the lookup query. Select the EmpoyeeId, EmpoyeeNumber, Startdate and Enddate from the employee dimension table (don't just select the whole table: see Performance Best Practices).
Lookup Query



















A2) Select columns
Select the input columns to map the lookup columns. Now you see the problem: The factdate can either map the Startdate or Enddate. There is no between here. For now map the EmployeeNumber and the Factdate according to the sample and select EmployeeId as a new Column.
Column Mappings



















A3) Custom Query
This step is a little different for SSIS 2005 and SSIS 2008 (and newer). Goto the Advanced Tab.
SSIS 2008:
In SSIS 2008 this tab is completely disabled with the message: This page is not used when Full cache mode is selected. To overcome this goto to the General tab and select Partial cache instead of Full Cache. Now you can modify the SQL Statement with the following query:
--Query with parameters
SELECT * 
FROM   (SELECT  EmployeeId
       ,        EmployeeNumber
       ,        StartDate
       ,        EndDate
       FROM     Dim_Employee) [refTable]
WHERE  [refTable].[EmployeeNumber] = ?
AND    [refTable].[StartDate] <= ?
AND    [refTable].[EndDate] > ?
Modify SQL Statement  in SSIS 2008



















SSIS 2005
For SSIS 2005: just Enable memory restriction and enter the query.
Modify SQL Statement in SSIS 2005






















A4) Parameters
Now Push the Parameters button on the advanced tab to enter the mappings. Select FactDate (the date column in the source table) for both Parameter1 and Parameter2.
Parameters





















A5) Result
Now your lookup is ready for testing.
The result: EmployeeId 12














The big big downside for this method is the lack of caching. You cannot use full cache. It will work for a small number records, but when the numbers grow, it will completely slow down your dataflow.

Solution B
Split dimension in separate days with datetime dimension
This solution is only possible if you have a time dimension.

B1) Query
Join the employee dimension and the time dimension, using between logic in the ON clause. This will result in a row for every dimension member for each day. 
--Query with join
SELECT      Dim_Employee.EmployeeId
,           Dim_Employee.EmployeeNumber
,           Dim_Time.Date
FROM        Dim_Employee
INNER JOIN  Dim_Time
            ON Dim_Time.Date
            BETWEEN Dim_Employee.StartDate
            AND Dim_Employee.EndDate
The new query, join with the time dimension



















B2) Select columns
Select the input columns to map the lookup columns. Unlike option A, the mapping is easy.
Column mapping



















B3) Result
Now you can test the dataflow and see that the result is simulair to Solution A. But we have to narrow down the number of records to improve the performance, because there are over 2500 records for this one employee.

There are a couple of options. If your source data contain only records for the the current year you could change the query to:
--Only current year
SELECT      Dim_Employee.EmployeeId
,           Dim_Employee.EmployeeNumber
,           Dim_Time.Date
FROM        Dim_Employee
INNER JOIN  Dim_Time
            ON Dim_Time.Date
            BETWEEN Dim_Employee.StartDate
            AND Dim_Employee.EndDate
WHERE       YEAR(Dim_Time.Date) = YEAR(GETDATE())       

Or you can use a MIN(date) query on your source data and use that in the where clause.
--Use minimum
SELECT      Dim_Employee.EmployeeId
,           Dim_Employee.EmployeeNumber
,           Dim_Time.Date
FROM        Dim_Employee
INNER JOIN  Dim_Time
            ON Dim_Time.Date
            BETWEEN Dim_Employee.StartDate
            AND Dim_Employee.EndDate
WHERE       Dim_Time.Date >= (SELECT  MIN(FactDate)
                              FROM    YourStaginTable))

And there are probably some more advanced queries to narrow down the number of records.

Solution C
Split dimension in separate days in second table.

If you don't have a time dimension table, you use a script (or query) to split all dimensions in separate days and copy those to a second table. Then use that second table in your lookup. And of cource try to narrow down the number of reconds just like in Solution B.

Conclusion
Every solution has it's pros and cons. The best solution for you depends on a number of things such as the number of records in your dimension and date spread in the fact records. Test it! Let me know if you found an other solution for your SCD Type II dimension lookup.

Alternatives without the Lookup Transformation
For large volumes of data there are a couple of alternatives:
1) Use a Merge Join Transformation instead of a lookup (join without the dates) and add a Conditional Split Transformation behind it that checks whether the date of the fact table is between the two dates of the dimension table.
2) The fasted option is to use a source query of an OleDB Source Component and do the between query of Solution A in SQL Server.

* Update *
Also see this Script Component solution from Matt Masson.

Create and fill Time dimension

Case
Is there an easy way to create and populate a time(/date) dimension?

Solution
Creating a time dimension is usually done once and probably not in SSIS, but with a TSQL script.
For each new assignment I use this script and adjust it to the requirements for that particular assignment .
-- Delete time dimension if it already exists.
IF Exists(Select Name from sysobjects where name = 'Dim_Time')
BEGIN
    Drop Table Dim_Time
END
GO

-- Standard options for creating tables
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

-- Create your dimension table
-- Adjust to your own needs
Create Table dbo.Dim_Time
(
    Dateid int IDENTITY (1,1) PRIMARY KEY CLUSTERED,
    Date date,
    DateString varchar(10),
    Day int,
    DayofYear int,
    DayofWeek int,
    DayofWeekName varchar(10),
    Week int,
    Month int,
    MonthName varchar(10),
    Quarter int,
    Year int,
    IsWeekend bit,
    IsLeapYear bit
)

-- Declare and set variables for loop
Declare
@StartDate datetime,
@EndDate datetime,
@Date datetime

Set @StartDate = '2000/01/01'
Set @EndDate = '2020/12/31'
Set @Date = @StartDate

-- Loop through dates
WHILE @Date <=@EndDate
BEGIN
    -- Check for leap year
    DECLARE @IsLeapYear BIT
    IF ((Year(@Date) % 4 = 0) AND (Year(@Date) % 100 != 0 OR Year(@Date) % 400 = 0))
    BEGIN
        SELECT @IsLeapYear = 1
    END
    ELSE
    BEGIN
        SELECT @IsLeapYear = 0
    END

    -- Check for weekend
    DECLARE @IsWeekend BIT
    IF (DATEPART(dw, @Date) = 1 OR DATEPART(dw, @Date) = 7)
    BEGIN
        SELECT @IsWeekend = 1
    END
    ELSE
    BEGIN
        SELECT @IsWeekend = 0
    END

    -- Insert record in dimension table
    INSERT Into Dim_Time
    (
    [Date],
    [DateString],
    [Day],
    [DayofYear],
    [DayofWeek],
    [Dayofweekname],
    [Week],
    [Month],
    [MonthName],
    [Quarter],
    [Year],
    [IsWeekend],
    [IsLeapYear]
    )
    Values
    (
    @Date,
    CONVERT(varchar(10), @Date, 105), -- See links for 105 explanation
    Day(@Date),
    DATEPART(dy, @Date),
    DATEPART(dw, @Date),
    DATENAME(dw, @Date),
    DATEPART(wk, @Date),
    DATEPART(mm, @Date),
    DATENAME(mm, @Date),
    DATENAME(qq, @Date),
    Year(@Date),
    @IsWeekend,
    @IsLeapYear
    )

    -- Goto next day
    Set @Date = @Date + 1
END
GO

Interesting links:
CAST and CONVERT:http://msdn.microsoft.com/en-us/library/ms187928.aspx
DATEPARThttp://msdn.microsoft.com/en-us/library/ms174420.aspx
DATENAMEhttp://msdn.microsoft.com/en-us/library/ms174395.aspx

Let me know if you have an interesting addition for this script that could help others.

Wednesday 12 January 2011

Create a Row Id

Case
If you add records to a database table with an ID column, you preferably would use an identity column (Identity Specification, Is Identity = Yes. But what if your destination does not support an auto-identity or you're not allowed to modify it? You could do this with our Rownumber Component or a Third Party component, or..

Solution
You can use a Script component to accomplish an auto-identity column. In this example I will get the highest ID from a table and use that number as a starting number for new records.

1) Add variable
Add an integer variable named Counter to store an ID.
Right click in Control or Data Flow to show variables














2) Add Execute SQL Task and Data Flow Task
Add an Execute SQL Task to your Control Flow and add a Data Flow Task right behind it.
Execute SQL Task















3) Get max ID
Edit the SQL task and change the ResultSet from None to Single Row. Select the right Connection and enter the query to get the highest ID.
General Tab of  the Execute SQL Task



















-- Get highest ID
SELECT  MAX([Id]) as MaxId
FROM    [YourTable]

4) Result Set
Continue editing the SQL task and go to the Result Set tab. Connect the field MaxId from the query to your variable Counter. After this the Execute SQL Task is ready.
Result Set



















5) DataFlow
Now go to your Data Flow. Add a random source, a Script Component (transformation) and a destination (the same table as in your Execute SQL Task. Give them suitable names. The result should look something like this.
Data Flow






















6) The script Component
We need a new column to store the RowId in. Add a new column on the tab Inputs and Outputs. The type should be an integer, size depends on the column size in your database table.
New column RowId



















7) The script itself
SSIS create 3 methods for you: PreExecute to get the MaxId from the variable, Input0_ProcessInputRow to fill the new column RowId and optional PostExecute to fill the variable with the new MaxId after all the records have passed. This third method is only required if your need that number somewhere else.
// C# code: surrogate key script
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    // New internal variable to store the rownumber
    private int rowCounter = 0;

    // Method that will be started before the rows start to pass
    public override void PreExecute()
    {
        base.PreExecute();

        // Lock variable for read 
        VariableDispenser variableDispenser = (VariableDispenser)this.VariableDispenser;
        variableDispenser.LockForRead("User::Counter");
    
        IDTSVariables100 vars;
        variableDispenser.GetVariables(out vars);

        // Fill the internal variable with the value of the SSIS variable
        rowCounter = (int)vars["User::Counter"].Value;

        // Unlock variable
        vars.Unlock();
    }

    // Method that will be started after all rows have passed
    // This method is optional. Only add it if you are gonna
    // use the SSIS variable after the dataflow is finished.
    public override void PostExecute()
    {
        base.PostExecute();

        // Lock variable for write
        VariableDispenser variableDispenser = (VariableDispenser)this.VariableDispenser;
        variableDispenser.LockForWrite("User::Counter");

        IDTSVariables100 vars;
        variableDispenser.GetVariables(out vars);

        // Fill the SSIS variable with the value of the internal variable
        vars["User::Counter"].Value = rowCounter;

        // Unlock variable
        vars.Unlock();
    }

    // Method that will be started for each record in you dataflow  
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        // Seed counter
        rowCounter++;
        // Fill the new column
        Row.RowId = rowCounter;
    }
}

Or VB.net

' VB.Net code: surrogate key script 
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute> _
<CLSCompliant(False)> _
Public Class ScriptMain
    Inherits UserComponent

    ' New internal variable to store the rownumber
    Private rowCounter As Integer = 0

    ' Method that will be started before the rows start to pass 
    Public Overrides Sub PreExecute()
        MyBase.PreExecute()

        ' Lock variable for read  
        Dim variableDispenser As VariableDispenser = CType(Me.VariableDispenser, VariableDispenser)
        variableDispenser.LockForRead("User::Counter")

        'Use IDTSVariables90 if you're using SSIS 2005
        Dim vars As IDTSVariables100
        variableDispenser.GetVariables(vars)

        ' Fill the internal variable with the value of the SSIS variable
        rowCounter = CInt(vars("User::Counter").Value)

        ' Unlock(Variable)
        vars.Unlock()
    End Sub

    ' Method that will be started after all rows have passed
    ' This method is optional. Only add it if you are gonna
    ' use the SSIS variable after the dataflow is finished.
    Public Overrides Sub PostExecute()
        MyBase.PostExecute()

        ' Lock variable for write 
        Dim variableDispenser As VariableDispenser = CType(Me.VariableDispenser, VariableDispenser)
        VariableDispenser.LockForWrite("User::Counter")

        'Use IDTSVariables90 if you're using SSIS 2005
        Dim vars As IDTSVariables100
        variableDispenser.GetVariables(vars)

        ' Fill the SSIS variable with the value of the internal variable 
        vars("User::Counter").Value = rowCounter
        ' Unlock variable 
        vars.Unlock()
    End Sub

    ' Method that will be started for each record in you dataflow   
    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        'Seed counter
        rowCounter = rowCounter + 1
        ' Fill the new column
        Row.RowId = rowCounter
    End Sub
End Class

8) Map in Destination
Make sure you don't forget to map the new column RowId in your destination. Now run your package to see the result.

An other option to create an unique RowId is to use a GUID instead of an integer. The Script component needs only one method for this solution.
// C# code
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    // Create a Globally Unique Identifier with SSIS
    Row.Guid = System.Guid.NewGuid();
}

' VB.Net code
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
    ' Create a Globally Unique Identifier with SSIS
    Row.Guid = System.Guid.NewGuid()
End Sub

Detailed information about that can be found here.

Saturday 8 January 2011

Zip sourcefile to archive

Case
I want to zip and delete my sourcefiles within the SSIS package when I have processed them.

Solution
This is a follow up on the unzip article. There are a lot of options to zip files within SSIS. For example:
* Update: my own zip task *





This solution uses the Microsoft Visual J# Redistributable Packages (don't worry you don't have to write J#), because it's free and you don't have use a third party or opensource dll on a production server which is often a big deal.

1) Visual J# Redistributable Packages
Install Microsoft Visual J# Redistributable Packages on your machine(s). You can download the correct version (x84, x64, IA64) at Microsoft.

2) Variables
Create two String variables named SourceFile and ArchiveFolder to store the filename and the zip/archive folder. Fill the ArchiveFolder with the folderpath where you want to store the zip files (for example H:\SoureFiles\Archive\).
Two String variables











3) Loop through folder with sourcefiles
This example loops through a folder, processes all files and then archives them. Create a foreach loop that loops through a folder with some text files (content doesn't matter for this example). Add an empty dataflow just for decoration. You can use that one later to process the textfiles.
Foreach file enumerator




















4) Map variable
Map the SourceFile variable to index 0 in the Foreach Loop.
Variable Mappings



















5) Script task with variables
Create a Script task within the Foreach loop and select the two variables as ReadOnly.
ReadOnly Variables



















6) Zip with Script task
Add a reference to the Microsoft Visual J# Redistributable Packages dll in your script. The name of the .NET dll is vjslib. After you have added it, it shows on the right site (3).
In Project menu, Add Reference...













After that add the following code (method and usings)
// C# code
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
using System.IO;        // Added
using java.util.zip;    // Added
using java.io;          // Added 

namespace ST_967ebb1a53884e89ac28f0a69ad082f1.csproj
{
    [System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
    public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
    {

        #region VSTA generated code
        enum ScriptResults
        {
            Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
            Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
        };
        #endregion

        public void Main()
        {
            // Variables for zipfile and unzip folder
            String SourceFile;
            SourceFile = Dts.Variables["SourceFile"].Value.ToString();
            String ArchiveFolder;
            ArchiveFolder = Dts.Variables["ArchiveFolder"].Value.ToString();
            
            // Determine zipfile path: archivefolder + filename (without path and extention) + "zip"
            // Example:
            // SourceFile = c:\temp\test.txt
            // ArchiveFolder = c:\temp\archive\
            // Then ZipFile = c:\temp\archive\test.zip
            String ZipFile;
            ZipFile = ArchiveFolder + SourceFile.Substring((SourceFile.LastIndexOf(@"\") + 1),(SourceFile.LastIndexOf(".") - SourceFile.LastIndexOf(@"\"))) + "zip";

            // Create Zipfile
            FileOutputStream fos = new FileOutputStream(ZipFile);
            ZipOutputStream zos = new ZipOutputStream(fos);

            // Add zipfile
            FileInputStream fis = new FileInputStream(SourceFile);

            // Remove drive and replace '\' with '/' for internal name
            // Example:
            // SourceFile = c:\temp\test.txt
            // Then internal name = temp/test.txt
            ZipEntry ze = new ZipEntry(SourceFile.Substring(3).Replace('\\', '/'));
            zos.putNextEntry(ze);

            // Loop through sourcefile to zip it
            sbyte[] buffer = new sbyte[1024];
            int len;
            while((len = fis.read(buffer)) >= 0)
            {
              zos.write(buffer, 0, len);
            }

            // Close all objects
            zos.closeEntry();
            fis.close();
            zos.close();
            fos.close();

            // Finally delete the source file
            System.IO.File.Delete(SourceFile);

            Dts.TaskResult = (int)ScriptResults.Success;
        }
    }
}
Note: you can delete the sourcefile with a File System Task aswell, but this is only one row of code extra.

7) Test
The end result looks something like this. Take it for a test drive...
The end result


















Note: This is a very basic script. You can add tons of checks and other features to make sure the script always works. Let me known if you have any questions. A unzip example is available here.

Friday 7 January 2011

Slowly Changing Dimension Alternatives

Case
The Slowly Changing Dimension component works fine for small dimensions, but performance is very poor for slightly bigger dimensions. Is there an alternative?

Solution
There is a request on the Microsoft site to improve SCD in SQL 2012 which you can support, but intil then... some alternatives:

A) There is an open source project named SSIS Dimension Merge CSD Component which performs better (100x faster according the developers).

B) There is a third party table difference component that does the same trick, but faster. Downside is the yearly $400,-. An other (free) third party component is the Konesans Checksum Transformation which you can use to accomplish a SCD. And we made our own Checksum Transformation.

C) There is the TSQL Merge statement which has been added to Sql Server 2008. It Performs insert, update, or delete operations on a target table based on the results of a join with a source table.

D) And you can use the standard ssis components (lookups) to accomplish a faster SCD.

Lets elaborate option D to see the difference between the standard Slowly Changing Dimension Component and the lookups. Case: 15481 clients in a dimension table. Process same clients again with 100 new clients and 128 changed.

Slowly Changing Dimension Component solution
The standard solution with the Slowly Changing Dimension component looks something like this and takes about 25 seconds on average to complete on my laptop.
Slowly Changing Dimension component













Lookups solution
The solution with two lookups looks something like this and takes less than 1,5 seconds on average to complete. The first lookup only compares the business key. If the business key from the source doesn't exist in the dimension table, then it's a new record/client. All other records go to the second lookup which compares all attributes (incl the business key) to the dimension table. When no match is found, the client has been updated.
Two lookups



















This performs very well if there are not too many updates. If I change the number of updates from 128 to 15422 records, the whole proces takes around two minutes with this method, which is only slightly faster than the standard SCD component.

Alternative for lot's of updates
An other alternative is to change the Update Clients to an insert statement which inserts all updated records into a second table. A lot of single update statements slow down the process. A fastload insert takes only a fragment of the time.
Change update to insert (into an other table)






















After that you can use a batch-update to update all changed clients.
Execute SQL Statement




















This alternative takes only six seconds in total. The batch-update query looks something like this.
--Batch update
UPDATE  [Dim_Client]
SET     [Dim_Client].Title          = [Dim_Client2].Title,
        [Dim_Client].FirstName      = [Dim_Client2].FirstName,
        [Dim_Client].LastName       = [Dim_Client2].LastName
FROM    [Dim_Client], [Dim_Client2]
WHERE   [Dim_Client].BusinessKey    = [Dim_Client2].BusinessKey

So there are alternatives for the standard SCD component. Which is best? That depends on the situation (are third party/open source dll's alowed on the production servers, how many updates do you expect, etc.). Let me know if you have an other alternative...

* Update: TSQL Merge added (suggested by Koen)*
Related Posts Plugin for WordPress, Blogger...