Microsoft SQL Server Integration Services: SSIS 2005

Showing posts with label SSIS 2005. Show all posts

Friday, 12 September 2014

Foreach loop with .xls wildcard also returns .xlsx files

Case
I have a Foreach Loop Container with a file enumerator. The wildcard is *.xls, but it also returns *.xlsx files. How do I prevent that?

Loop through *.xls also includes xlsx files

My xls loop includes xlsx and xlsm files

Solution
This is actually similar to the DIR command in a DOS/Command Prompt.

All xls files? (/b is to remove my Dutch header/footer)

The workaround is simple. And if you don't like the solution then you could use my Sorted File Enumerator that also supports regular expression wildcards.

1) Dummy
Add an empty/dummy task or Sequence Container in your Foreach Loop Container. And connect it to your first task.

Empty/collapsed Sequence Container added

2) Precedence Constraint Expression
Add an expression on the Precedence Constraint between the dummy and your first task. It should look something like LOWER(RIGHT(@[User::FilePath], 4)) == ".xls" (replace the variablename and/or file extension).

Expression with LOWER and RIGHT to check the file extension

3) The result
Now test the package (Replace my example Script Task with your own tasks).

The result: only two xls files and no xlsx or xlsm files

Sunday, 10 November 2013

Execute multiple child packages in parallel with loop

Case
I used a foreach loop to execute all my staging packages, but my server isn't using all resources. Is there a way to execute multiple child packages at once, but with a loop?

Executing packages sequentially

Solution
The trick is to use a queue of packages and to have multiple 'processes' taking packages from the queue to execute them. The number of packages that can be executed at a time depends on the complexity of your packages. Staging packages for a single source are not complex, so a good guideline/starting point is to execute one package per processor core.

4 cores => execute 4 child packages at the same time

1) The Queue
For this example I will use a FIFO (first in, first out) queue that is stored in a database table. Alternatives could be the Windows Message Queue or the SQL Service Broker.

-- Create QUEUE table
CREATE TABLE [dbo].[FifoPackageQueue](
 [Id] [bigint] IDENTITY(1,1) NOT NULL,
 [Package] [varchar](50) NULL
) ON [PRIMARY]

GO

-- Add Index on Id
CREATE CLUSTERED INDEX cdxFifoPackageQueue on FifoPackageQueue (Id)

GO

-- Log tabel
CREATE TABLE [dbo].[FifoPackageQueueLog](
 [Id] [bigint] IDENTITY(1,1) NOT NULL,
 [Package] [varchar](50) NULL,
 [StartTime] [datetime] NULL,
 [EndTime] [datetime] NULL
) ON [PRIMARY]

GO

2) Variables and Parameter
For each 'execution line' in my control flow I need one SSIS string variable to store the package name. And I use one Package Parameter to indicate how many execution lines will be active. You can also use an integer variable for that if your SSIS version doesn't have parameters.

String variables, one per execution line

Integer Package Parameter.

3) Fill Queue
In this example I will use an INSERT INTO SELECT query to get all staging packages from SSISDB. You could also use regular INSERT queries or even a Foreach Loop or Script Task that loops through a folder to add packages from the file system to the queue.

INSERT INTO CONV_LOG..FifoPackageQueue
(
     [Package]
)
SELECT      Packages.[name]
FROM        [SSISDB].[internal].[packages] as Packages
INNER JOIN  [SSISDB].[internal].[projects] as Projects
            on Packages.project_version_lsn = Projects.object_version_lsn
WHERE       Projects.name = 'MyProject'
AND         Packages.name like 'STG%';

4) Execution Lines
Each Execution line starts with a Sequence Container connected to the Fill queue task. The Precedence Constraint Expression is @[$Package::ParallelProcesses] >= 1 for the first and @[$Package::ParallelProcesses] >= 2 for the second and so on.

Expression to limit the number of parallel executions

5) Get first package from queue
The Execute SQL Task gets the first package name from the queue and stores it in the SSIS string variable. If the variable is empty then it doesn't continue to the next For Loop.

Output clause is available in SQL 2005

set nocount on;

-- Declare temporary table to store the result of the delete
DECLARE @TmpTable TABLE (Package varchar(50));
-- Declare integer variable for counting the delete result
DECLARE @PackageCount int;

-- Select first record, lock it, delete it and store name in temporary table
WITH cte as (
   SELECT  top(1) Package
   FROM  FifoPackageQueue WITH (rowlock, readpast)
   ORDER BY Id
   )
DELETE FROM cte
output deleted.Package INTO @TmpTable

-- Check if there is 1 record in temporary table
SELECT @PackageCount = count(*) FROM @TmpTable
if @PackageCount = 1
BEGIN
 -- Return package name
 SELECT Package FROM @TmpTable
END
ELSE
BEGIN
 -- Temporary table was empty so queue was empty
 -- Return empty string to stop next precedence constraint
 SELECT '' as Package
END

Store package name in SSIS variable

6) For Loop
The For Loop loops until the package name is empty.

Loop until empty

7) Log starttime
The first Execute SQL Task in the loop inserts the package name and a GETDATE() for the starttime in the log table with an INSERT query. The variable containing the package name is a parameter for this task. A very simple/basic log mechanisme. Adjust it to your needs or remove it if you have an other log mechanism.

INSERT INTO  FifoPackageQueueLog (Package, StartTime)
VALUES   (?, GETDATE())

8) Execute Package Task
Add an expression on the packagename so that it gets replaced with the value of the variable. In the properties of the task set DelayValidation = True. This will prevent errors if your variable is empty.

9) Log endtime
The second Execute SQL Task in the loop logs the enddate with an UPDATE query. The variable containing the package name is a parameter for this task. The start- and enddate will help you choose the optimal number of parallel tasks.

UPDATE  FifoPackageQueueLog
SET   EndTime = GETDATE()
WHERE  Package = ?
AND   EndTime is null

10) Get next package from queue
This is the exact same task/query as for getting the first package from the queue. If it can't find a package in the queue then it will fill the variable with an empty string and the For Loop will stop.

11) Multiple Execution lines
Repeat steps 4 to 10 an x times. Probably a couple more then you have processor cores in your server. Then start testing to find out the optimal number of parallel tasks.

12) Download example package
For SSIS 2012 I have added an example package for download. It contains 5 execution lines. Add more if you have more cores available. The example is provided for educational purposes only. This example package is not intended to be used in a production environment and has not been tested in a production environment. Test it thoroughly before using it.

Note: you will find a similar solution in the 10 Tips and Tricks for Better SSIS Performance presentation of David Peter Hansen and also in the SQL Rally Amsterdam presentation of Davide Mauri about Automating DWH Patterns Through Metadata. This example package is inspired by their solutions.

Sunday, 14 April 2013

Performance Best Practice: Flat File (Fast) Parse

Here are some tips to speed up the reading of flat files within the SSIS data flow. They are especially handy for importing large flat files (or when you merge join your small flat file to a large dataset). Start, where possible, by reading the files from a fast drive (a Solid State Disk / preferably not used by Windows or SQL Server) instead of some share.

Minimum datatype
By default all columns are string 50 in a Flat File Connection Manager. To get as many rows in a data flow buffer it's important to use a minimum data type for each column. And if it's for example an integer column then parse it to an integer in the Flat File Connection Manager instead of parsing it in the data flow with a Data Conversion or Derived Column Transformation. Otherwise you end up with two columns in your buffer instead of one.

A good start is to use the Suggest Types button in the Flat File Connection Manager editor. It will scan a couple of rows from your flat file and come up with a minimum data type.

Suggest Types

Unused columns
In the Flat File Connection Manager it's impossible to skip columns that you don't need, but to minimize pressure on performance you should not parse unneeded columns. Parsing/converting is expensive. Just leave it string. In the Flat File Source editor you can uncheck the unneeded columns.

Leave it string, don't parse.

Uncheck unneeded flat file columns

Fast Parse
If you have a so called 'trusted' source (for example a database export) then you don't have to worry for mistakes in the data types. To speed up the performance you can enable fast parse for all non-string columns. But for date columns you have to be sure that the format is correct. Try using the ISO format YYYY-MM-DD.
You can enable fast parse in the Advanced Editor of the Flat File Source. Right click it and choose "Show Advanced Editor...".

Show Advanced Editor...

Then go to the Input and Output Properties tab and then to the Output Columns. Select a non-string column and set the FastParse property to true. Repeat this for all non-string columns.

Enable FastParse

Bulk Insert Task
If your destination is a SQL Server table and you don't need data transformations then you might want to consider/test the Bulk Insert Task as a alternative for the Data Flow Task.

Summary
Minimum data types
Parse in Connection Manager
No unnecessary parsing
Fast Parse

More info: Blog of Jamie Thomson and Henk van der Valk or MSDN

Saturday, 13 April 2013

Performance Best Practice: more rows per buffer

SSIS uses buffers to transport a set of rows through the data flow task. In general the less buffers you need to transport all rows from the source to the destination the faster it is.

You can compare a buffer going through a data flow, with a fire bucket going through a medieval line of people trying to extinguish a fire.

Buffers going through a data flow

Source: http://blog.advocate-art.com/index.php/archives/2192/greatfire_bucketlinelo-3

You can extinguish the fire faster by adding an extra line of people (parallel process) or my making the buckets larger (but not too large, otherwise they can't lift the buckets).

For this example I use a table with Google Analytics data from my blog staged in a database table.

Table with quite large columns

Changing Buffer Size
By default the buffer size is 10 Megabytes (1024 * 1024 * 10 = 10,485,760 bytes). You can adjust this size for each Data Flow Task (see Data Flow Task Property DefaultBufferSize).

DefaultBufferSize 64KB <> 100MB

I have a Package that pauses between buffers with a Script Component. This allows me to see how many records there are in one buffer. In my case 1 row of Google Analytics data is about 29000 bytes and if I run my Data Flow Task I can see that 360 rows of data fit in to my 10MB buffer.

360 rows = ±10MB

By making my buffer twice the size you will see that the number of rows in my buffer doubles. You can change the buffer size up to 100MB, but you have to remember that there are multiple buffers going through your data flow at the same time. If you don't have enough memory to fit all those buffers, then SSIS will swap memory to disk making it very slow! So only use bigger buckets if you can lift them...

721 rows = ± 20MB

Adjust row size
Never use a table as a datasource! Selecting a table is akin to "SELECT *..." which is universally recognized as bad practice(*). Always use a query. This allows you to skip unwanted columns and rows.

Unchecking "Available External Columns" at the Columns pane of the OLE DB Source Editor will prevent unwanted column in the buffer, but the data will have to go from SQL Server to SSIS. Removing unwanted column with a query will reduce the number of network packets going from a SQL Server machine to your SSIS machine.

Use query instead of whole table

Let's remove an unwanted column to see the effect. Removing the nvarchar(4000) column [ga:referralPath] will reduce the total recordsize from 29030 to 21028 bytes. Now 497 rows instead of 320 will fit in my 10MB buffer.

Removing unwanted columns => more rows in buffer

Adjust Column Size
Often column are just unnecessary too big in the source. If you can't change the source then you could change it in the source query with a CAST or CONVERT. For example the nvarchar(4000) column for page titles is way to big. My titles are not that long and the characters should fit in varchar. Let's CAST it to varchar(255) and see how many extra rows will fit in the buffer.

Resizing datatypes => more rows in buffer

Now we have 787 rows instead of 320 rows in the 10mb buffer. That should speed up this package!

Note: if you change datatypes in an existing source query, then you have to uncheck and check the changed columns in the Columns pane of the OLE DB Source editor. This will refresh the metadata in the data flow. Otherwise you will get an error like: Validation error. DFT - Visits: DFT - Visits: Column "ga:pageTitle" cannot convert between unicode and non-unicode string data types.

Uncheck and check changed columns

Add WHERE clause
Don't use a Conditional Split Transformation to remove records after an database table source. Instead use a WHERE clause in your source query. The Conditional Split is a so called Magic Transformation. It will only visually hide records instead of removing them from the buffer. And the records are unnecessary transported from SQL Server to SSIS.

Add WHERE clause instead of Conditional Split

RunInOptimizedMode
This is a Data Flow Task property that, if set to true, will 'remove' unused columns and unused transformations to improve performance. Default value is true, but this property isn't the holy grail and you shouldn't rely on it while developing your packages.

RunInOptimizedMode

Enable logging on Data Flow Task
There are some events that you could log to get information about the buffer:
User:BufferSizeTuning: This will tell you how many rows there were in the buffer.

Rows in buffer type 0 would cause a buffer size greater than the configured maximum. There will be only 787 rows in buffers of this type.
Rows in buffer type 0 would cause a buffer size greater than the configured maximum. There will be only 1538 rows in buffers of this type.

User:PipelineInitialization: This will tell you something about which settings are used.

The default buffer size is 10485760 bytes.
Buffers will have 10000 rows by default.
The data flow will not remove unused components because its RunInOptimizedMode property is set to false.

BufferSizeTuning and PipelineInitialization

Performance Counters
SSIS comes with a set of performance counters that you can use to monitor the performance of the Data Flow Task. More about that in a later blog. For now see msdn.

Summary
Optimize Buffer Size
Use Source Query
Adjust Row Size
Adjust Column Size
Use Where clause

More info: Blog of Matt Masson, *Jamie Thomson, SQLCAT and MSDN

Tuesday, 9 April 2013

Performance Best Practice: Network Packet Size

Network
If your SSIS package gets its data from a SQL Server database located on an other machine, then the data of your query will go from the SQL Server machine over the network to your SSIS machine. This is done in small network packets with a default size of 4096 bytes (4 kilobytes).

Buffers
The Data Flow Task uses buffers to transport multiple rows of data through the data flow. Its default size is 10485760 bytes (10 megabytes). This means you need 2560 network packets (10485760 / 4096) to fill one buffer. By increasing the size of the packets you need less packets to fill that buffer. With the maximum size of 32768 bytes (32 kilobytes) you only need 320 network packets (10485760 / 32768) to fill one buffer.
This could, depending on the quality of your network, improve the performance significant (For poor performing networks you need to resend a lot of packets, making large packets inefficient).

Default Max Buffer Size: 10 MB

You can adjust the default size of 4096 bytes within SQL Server. However I strongly recommend not to change that in SQL Server. You should add the network packet size to the connectionstring to override this default value.

Do not change within SQL Server

OLE DB:
Data Source=mySqlServer\SQL2012;Initial Catalog=myDatabase;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;Packet Size=8192;

ADO.NET
Data Source=mySqlServer\sql2012;Initial Catalog=myDatabase;Integrated Security=True;Packet Size=10000;Application Name=SSIS-package;

SSIS OLE DB Connection Manager
Within SSIS you can change this value in the OLE DB Connection Manager or directly in the Connectionstring property. The maximum value for the network packet size is 32768 bytes, but the property is zero-based, so the max number is 32767. The value 0 means use the SQL Server default.

Network Packet Size property OLE DB

SSIS ADO.Net Connection Manager
The ADO.Net Connection Manager has a default of 8000 bytes. this one isn't zero-based, so its max value is 32768.

Network Packet Size property ADO.Net

Note 1: if you are using package configurations on the connectionstring property then don't forget to add this value to the connectionstring in your config table/file.
Note 2: you could enable jumbo frames on your network to increase performance even more, but you should consult a network specialist for that.

More info: Blog of Henk van der Valk, SQLCAT and MSDN.

Wednesday, 6 February 2013

Value of variable during runtime

Case
I want to know the value of a variable during runtime of an SSIS package.

Value of variable between two tasks

Solution
You can either use a breakpoint or a Script Task to find out the value of a variable.

A) Script Task

A1) Add Script Task
Add a Script Task between the two tasks and select the FilePath variable as readdonly variable.

Script Task - Readonly variable

A2) The script
There are three options that you could choose. Pick one.

//C# code
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;

namespace ST_d730d75a40304a6bb675bc184c2aa717
{
 [Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute]
 public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
 {

  public void Main()
  {
        // Choose one of these methods:

        // 1) Fire event and watch the execution result tab
        bool fireAgain = true;
        Dts.Events.FireInformation(-1, "Value of FilePath:", Dts.Variables["User::FilePath"].Value.ToString(), string.Empty, -1, ref fireAgain);


        // 2) Use the .Net framework trace log and see Debug View: http://technet.microsoft.com/en-us/sysinternals/bb896647.aspx
        System.Diagnostics.Trace.WriteLine("Value of FilePath: " + Dts.Variables["User::FilePath"].Value.ToString());


        // 3) Good old messagebox and click to continue
        System.Windows.Forms.MessageBox.Show("Value of FilePath: " + Dts.Variables["User::FilePath"].Value.ToString());


        Dts.TaskResult = (int)ScriptResults.Success;
  }

        #region ScriptResults declaration
        /// <summary>
        /// This enum provides a convenient shorthand within the scope of this class for setting the
        /// result of the script.
        /// 
        /// This code was generated automatically.
        /// </summary>
        enum ScriptResults
        {
            Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
            Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
        };
        #endregion
 }
}

or VB.Net

'VB.Net code
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime

<Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute()> _
<System.CLSCompliantAttribute(False)> _
Partial Public Class ScriptMain
    Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase

    Public Sub Main()
        ' Choose one of these methods:

        ' 1) Fire event and watch the execution result tab
        Dim fireAgain As Boolean = True
        Dts.Events.FireInformation(-1, "Value of FilePath:", Dts.Variables("User::FilePath").Value.ToString(), String.Empty, -1, fireAgain)


        ' 2) Use the .Net framework trace log and see Debug View: http://technet.microsoft.com/en-us/sysinternals/bb896647.aspx
        System.Diagnostics.Trace.WriteLine("Value of FilePath: " + Dts.Variables("User::FilePath").Value.ToString())


        ' 3) Good old messagebox and click to continue
        System.Windows.Forms.MessageBox.Show("Value of FilePath: " + Dts.Variables("User::FilePath").Value.ToString())


        Dts.TaskResult = ScriptResults.Success
    End Sub

#Region "ScriptResults declaration"
    'This enum provides a convenient shorthand within the scope of this class for setting the
    'result of the script.

    'This code was generated automatically.
    Enum ScriptResults
        Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success
        Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
    End Enum

#End Region

End Class

A3) The Result
Now run the package and watch the result.

Oops, no flat file but an Excel file

B) Breakpoints

B1) Breakpoint
Right click the Data Flow Task, Choose "Edit Breakpoints..." and then select the OnPreExecute event. This is the event right before starting the Data Flow Task.

Add breakpoint

B2) Excute package
Now execute the package and wait for it to hit the breakpoint.

Run package and wait for breakpoint

B3) Locals
Wait for the package to hit the breakpoint. Then go to the Debug menu and click Windows and then Locals (Ctrl+Alt+V,L). This wil open a new window.

Locals

B4) The result
Now downdrill the variables in the Locals window and search for your variable and its value.

Oops, no flat file but an Excel file

This last method is probably a lot easier.

Tuesday, 1 January 2013

Master Child packages - Part 2: SQL Server based

Case
An often seen solution is a master package calling a couple of child packages with the Execute Package Task. This works fine for a couple of packages, but is a little boring for a whole bunch of packages. Is there an easier more clear way to maintain a master package?

Server based and file based child packages

Solution
A simple solution is to use a Foreach Loop Container with an Execute Package Task in it that loops through a folder with packages. It works both for file-based and server-based packages.

But there are a couple of drawbacks:
Drawback 1: The child packages are not executed simultaneously, but one after another. Will handle this problem in a future post.
Drawback 2: The options to determine the order of execution are limited. You can only order by name. So if a certain order is required then you need to add some prefix to the packagename to determine the order.

I have prepared three solutions:
A) File based: SSIS 2005, 2008 or 2012 if you use package deployment.
B) SQL Server based: SSIS 2005, 2008 or 2012 if you use package deployment.
C) Project Referenced: SSIS 2012 if you use project deployment. This solution is nearly equal to solution B.

B) SQL Server based
For this solution, you need to create a query on the msdb database to get the list of packages from Integration Service. For this query we need the system tables sysssispackages and sysssispackagefolders.

Get list of packages from MSDB in SSIS 2008

1) Variables
Add a string variable to the package and name it PackagePath. This will contain the filepath of the package. Also create an object variable named Packages. This will contain a list of packages from the MSDB.

Right click in Control Flow

2) OLE DB Connection Manager
Create an OLE DB Connection Manger that connects to the msdb database. We will use this connection manager for geting a list of packages and to execute the SQL Server based packages.

OLE DB Connection to MSDB

3) Execute SQL Task
Add an Execute SQL Task and give it a suitable name. Edit it; Set ResultSet to Full result set. Select the newly created Connection Manger and enter the query below.

Execute SQL Task

-- Get list of packages. Change the where clause.
SELECT   '\' + folders.foldername + '\' + packages.name as PackagePath --'Concatenate
FROM   msdb.dbo.sysssispackages as packages
INNER JOIN  msdb.dbo.sysssispackagefolders as folders
    on folders.folderid = packages.folderid
WHERE   folders.foldername = 'Staging'
AND    packages.name like 'STG%'
ORDER BY  packages.name

4) Execute SQL Task - Result Set
Go to the Result Set pane and click Add and select the object variable from step 1. The Result Name should be 0.

Result Set

5) Foreach Loop
Add a Foreach Loop Container to the control flow and give it a suitable name. Then connect the Execute SQL Task to the Foreach Loop.

Foreach Loop Container

6) Foreach ADO Enumerator
Edit the Foreach loop and select the Foreach ADO Enumerator as the enumerator type. After that select the object variable Packages as the ADO object source variable. The Enumeration mode should be "Rows in the first table".

Foreach ADO Enumerator

7) Variable Mapping
Go to the Variable Mapping pane and select the PackagePath variable for index 0.

Variable Mappings

8) Execute Package Task
Add an Execute Package Task in the Foreach Loop. Give it a suitable name and configure it to call one of your child packages (Location = SQL Server). Just pick one. We will overrull the path in the next step.

Execute Package Task

9) Expression
Go to the properties of your newly created Execute Package Task and add an expression on the PackageName property that overrules its value with the variable PackagePath from step 1.

Expression overrulling PackageName(path)

10) Delay Validation
If the value of the PackagePath variable doesn't contain a real path of a variable, then you will get a validation error on runtime. You could either fill the variable with a default value or just set the Delay Validation property of the Execute Package Task to false.

The package is not specified

11) The result
A clear package with only one Execute Package Task.

The result

Go to A) File based or C) Project Referenced