Sunday, 14 April 2013

Performance Best Practice: Flat File (Fast) Parse

Here are some tips to speed up the reading of flat files within the SSIS data flow. They are especially handy for importing large flat files (or when you merge join your small flat file to a large dataset). Start, where possible, by reading the files from a fast drive (a Solid State Disk / preferably not used by Windows or SQL Server) instead of some share.

Minimum datatype
By default all columns are string 50 in a Flat File Connection Manager. To get as many rows in a data flow buffer it's important to use a minimum data type for each column. And if it's for example an integer column then parse it to an integer in the Flat File Connection Manager instead of parsing it in the data flow with a Data Conversion or Derived Column Transformation. Otherwise you end up with two columns in your buffer instead of one.

A good start is to use the Suggest Types button in the Flat File Connection Manager editor. It will scan a couple of rows from your flat file and come up with a minimum data type.

Suggest Types

Unused columns
In the Flat File Connection Manager it's impossible to skip columns that you don't need, but to minimize pressure on performance you should not parse unneeded columns. Parsing/converting is expensive. Just leave it string. In the Flat File Source editor you can uncheck the unneeded columns.
Leave it string, don't parse.

Uncheck unneeded flat file columns

Fast Parse
If you have a so called 'trusted' source (for example a database export) then you don't have to worry for mistakes in the data types. To speed up the performance you can enable fast parse for all non-string columns. But for date columns you have to be sure that the format is correct. Try using the ISO format YYYY-MM-DD.
You can enable fast parse in the Advanced Editor of the Flat File Source. Right click it and choose "Show Advanced Editor...".
Show Advanced Editor...

Then go to the Input and Output Properties tab and then to the Output Columns. Select a non-string column and set the FastParse property to true. Repeat this for all non-string columns.
Enable FastParse

Bulk Insert Task
If your destination is a SQL Server table and you don't need data transformations then you might want to consider/test the Bulk Insert Task as a alternative for the Data Flow Task.

Minimum data types
Parse in Connection Manager
No unnecessary parsing
Fast Parse

More info: Blog of Jamie Thomson and Henk van der Valk or MSDN

Saturday, 13 April 2013

Performance Best Practice: more rows per buffer

SSIS uses buffers to transport a set of rows through the data flow task. In general the less buffers you need to transport all rows from the source to the destination the faster it is.

You can compare a buffer going through a data flow, with a fire bucket going through a medieval line of people trying to extinguish a fire.

Buffers going through a data flow

You can extinguish the fire faster by adding an extra line of people (parallel process) or my making the buckets larger (but not too large, otherwise they can't lift the buckets).

For this example I use a table with Google Analytics data from my blog staged in a database table.
Table with quite large columns

Changing Buffer Size
By default the buffer size is 10 Megabytes (1024 * 1024 * 10 = 10,485,760 bytes). You can adjust this size for each Data Flow Task (see Data Flow Task Property DefaultBufferSize).
DefaultBufferSize 64KB <> 100MB

I have a Package that pauses between buffers with a Script Component. This allows me to see how many records there are in one buffer. In my case 1 row of Google Analytics data is about 29000 bytes and if I run my Data Flow Task I can see that 360 rows of data fit in to my 10MB buffer.
360 rows = ±10MB

By making my buffer twice the size you will see that the number of rows in my buffer doubles. You can change the buffer size up to 100MB, but you have to remember that there are multiple buffers going through your data flow at the same time. If you don't have enough memory to fit all those buffers, then SSIS will swap memory to disk making it very slow! So only use bigger buckets if you can lift them...
721 rows = ± 20MB

Adjust row size
Never use a table as a datasource! Selecting a table is akin to "SELECT *..." which is universally recognized as bad practice(*). Always use a query. This allows you to skip unwanted columns and rows.

Unchecking "Available External Columns" at the Columns pane of the OLE DB Source Editor will prevent unwanted column in the buffer, but the data will have to go from SQL Server to SSIS. Removing unwanted column with a query will reduce the number of network packets going from a SQL Server machine to your SSIS machine.

Use query instead of whole table

Let's remove an unwanted column to see the effect. Removing the nvarchar(4000) column [ga:referralPath] will reduce the total recordsize from 29030 to 21028 bytes. Now 497 rows instead of 320 will fit in my 10MB buffer.
Removing unwanted columns => more rows in buffer

Adjust Column Size
Often column are just unnecessary too big in the source. If you can't change the source then you could change it in the source query with a CAST or CONVERT. For example the nvarchar(4000) column for page titles is way to big. My titles are not that long and the characters should fit in varchar. Let's CAST it to varchar(255) and see how many extra rows will fit in the buffer.
Resizing datatypes => more rows in buffer

Now we have 787 rows instead of 320 rows in the 10mb buffer. That should speed up this package!

Note: if you change datatypes in an existing source query, then you have to uncheck and check the changed columns in the Columns pane of the OLE DB Source editor. This will refresh the metadata in the data flow. Otherwise you will get an error like: Validation error. DFT - Visits: DFT - Visits: Column "ga:pageTitle" cannot convert between unicode and non-unicode string data types.
Uncheck and check changed columns

Add WHERE clause
Don't use a Conditional Split Transformation to remove records after an database table source. Instead use a WHERE clause in your source query. The Conditional Split is a so called Magic Transformation. It will only visually hide records instead of removing them from the buffer. And the records are unnecessary transported from SQL Server to SSIS.
Add WHERE clause instead of Conditional Split

This is a Data Flow Task property that, if set to true, will 'remove' unused columns and unused transformations to improve performance. Default value is true, but this property isn't the holy grail and you shouldn't rely on it while developing your packages.

Enable logging on Data Flow Task
There are some events that you could log to get information about the buffer:
User:BufferSizeTuning: This will tell you how many rows there were in the buffer.
  • Rows in buffer type 0 would cause a buffer size greater than the configured maximum. There will be only 787 rows in buffers of this type.
  • Rows in buffer type 0 would cause a buffer size greater than the configured maximum. There will be only 1538 rows in buffers of this type.
User:PipelineInitialization: This will tell you something about which settings are used.
  • The default buffer size is 10485760 bytes.
  • Buffers will have 10000 rows by default.
  • The data flow will not remove unused components because its RunInOptimizedMode property is set to false.
BufferSizeTuning and PipelineInitialization

Performance Counters
SSIS comes with a set of performance counters that you can use to monitor the performance of the Data Flow Task. More about that in a later blog. For now see msdn.

Optimize Buffer Size
Use Source Query
Adjust Row Size
Adjust Column Size
Use Where clause

More info: Blog of Matt Masson, *Jamie ThomsonSQLCAT and MSDN

Tuesday, 9 April 2013

Performance Best Practice: Network Packet Size

If your SSIS package gets its data from a SQL Server database located on an other machine, then the data of your query will go from the SQL Server machine over the network to your SSIS machine. This is done in small network packets with a default size of 4096 bytes (4 kilobytes).

The Data Flow Task uses buffers to transport multiple rows of data through the data flow. Its default size is 10485760 bytes (10 megabytes). This means you need 2560 network packets (10485760 / 4096) to fill one buffer. By increasing the size of the packets you need less packets to fill that buffer. With the maximum size of 32768 bytes (32 kilobytes) you only need 320 network packets (10485760 / 32768) to fill one buffer.
This could, depending on the quality of your network, improve the performance significant (For poor performing networks you need to resend a lot of packets, making large packets inefficient).
Default Max Buffer Size: 10 MB

You can adjust the default size of 4096 bytes within SQL Server. However I strongly recommend not to change that in SQL Server. You should add the network packet size to the connectionstring to override this default value.
Do not change within SQL Server

Data Source=mySqlServer\SQL2012;Initial Catalog=myDatabase;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;Packet Size=8192;

Data Source=mySqlServer\sql2012;Initial Catalog=myDatabase;Integrated Security=True;Packet Size=10000;Application Name=SSIS-package;

SSIS OLE DB Connection Manager
Within SSIS you can change this value in the OLE DB Connection Manager or directly in the Connectionstring property. The maximum value for the network packet size is 32768 bytes, but the property is zero-based, so the max number is 32767. The value 0 means use the SQL Server default.

Network Packet Size property OLE DB

SSIS ADO.Net Connection Manager
The ADO.Net Connection Manager has a default of 8000 bytes. this one isn't zero-based, so its max value is 32768.
Network Packet Size property ADO.Net

Note 1: if you are using package configurations on the connectionstring property then don't forget to add this value to the connectionstring in your config table/file.
Note 2: you could enable jumbo frames on your network to increase performance even more, but you should consult a network specialist for that.

More info: Blog of Henk van der Valk, SQLCAT and MSDN.

Monday, 8 April 2013

SQL Saturday #221

The powerpoint (in Dutch!) of my SQL Saturday presentation about SSIS Performance Tuning is available for download. Added some screenshots for most demo's and a couple of URL's for additional information.


SQL Rally 2013
And some really good news... SQL Saturday in autumn 2013 will be replaced by SQL Rally. Three days instead of one day! SQL Rally will be hosted in Amsterdam on November 6-8 2013.
Related Posts Plugin for WordPress, Blogger...