Microsoft SQL Server Integration Services: BEST PRACTICES

Showing posts with label BEST PRACTICES. Show all posts

Thursday, 1 December 2016

SSIS Naming conventions

In 2006 Jamie Thomson came up with naming conventions for SSIS tasks and data flow components. These naming conventions make your packages and logs more readable. Five SQL Server versions and a decade later a couple of tasks and components were deprecated, but there were also a lot of new tasks and components introduced by Microsoft.

Together with Koen Verbeeck (B|T) and André Kamman (B|T) we extended the existing list with almost 40 tasks/components and created a PowerShell Script that should make it easier to check/force the naming conventions. This PowerShell script will soon be published at GitHub as a PowerShell module. But for now you can download and test the fully working proof of concept script. Download both ps1 files and the CSV file. Then open "naming conventions v4.ps1" and change the parameters before executing it. The script works with local packages because you can't read individual package from the catalog, but you can use a powershell script to download your packages from the catalog.

PowerShell Naming Conventions Checker

Task name	Prefix	Type	New
For Loop Container	FLC	Container
Foreach Loop Container	FELC	Container
Sequence Container	SEQC	Container
ActiveX Script	AXS	Task
Analysis Services Execute DDL Task	ASE	Task
Analysis Services Processing Task	ASP	Task
Azure Blob Download Task	ADT	Task	*
Azure Blob Upload Task	AUT	Task	*
Azure HDInsight Create Cluster Task	ACCT	Task	*
Azure HDInsight Delete Cluster Task	ACDT	Task	*
Azure HDInsight Hive Task	AHT	Task	*
Azure HDInsight Pig Task	APT	Task	*
Back Up Database Task	BACKUP	Task	*
Bulk Insert Task	BLK	Task
CDC Control Task	CDC	Task	*
Check Database Integrity Task	CHECKDB	Task	*
Data Flow Task	DFT	Task
Data Mining Query Task	DMQ	Task
Data Profiling Task	DPT	Task	*
Execute Package Task	EPT	Task
Execute Process Task	EPR	Task
Execute SQL Server Agent Job Task	AGENT	Task	*
Execute SQL Task	SQL	Task
Execute T-SQL Statement Task	TSQL	Task	*
Expression Task	EXPR	Task
File System Task	FSYS	Task
FTP Task	FTP	Task
Hadoop File System Task	HFSYS	Task	*
Hadoop Hive Task	HIVE	Task	*
Hadoop Pig Task	PIG	Task	*
History Cleanup Task	HISTCT	Task	*
Maintenance Cleanup Task	MAINCT	Task	*
Message Queue Task	MSMQ	Task
Notify Operator Task	NOT	Task	*
Rebuild Index Task	REBIT	Task	*
Reorganize Index Task	REOIT	Task	*
Script Task	SCR	Task
Send Mail Task	SMT	Task
Shrink Database Task	SHRINKDB	Task	*
Transfer Database Task	TDB	Task
Transfer Error Messages Task	TEM	Task
Transfer Jobs Task	TJT	Task
Transfer Logins Task	TLT	Task
Transfer Master Stored Procedures Task	TSP	Task
Transfer SQL Server Objects Task	TSO	Task
Update Statistics Task	STAT	Task	*
Web Service Task	WST	Task
WMI Data Reader Task	WMID	Task
WMI Event Watcher Task	WMIE	Task
XML Task	XML	Task

Transformation name	Prefix	Type	New
ADO NET Source	ADO_SRC	Source	*
Azure Blob Source	AB_SRC	Source	*
CDC Source	CDC_SRC	Source	*
DataReader Source	DR_SRC	Source
Excel Source	EX_SRC	Source
Flat File Source	FF_SRC	Source
HDFS File Source	HDFS_SRC	Source	*
OData Source	ODATA_SRC	Source	*
ODBC Source	ODBC_SRC	Source	*
OLE DB Source	OLE_SRC	Source
Raw File Source	RF_SRC	Source
SharePoint List Source	SPL_SRC	Source
XML Source	XML_SRC	Source
Aggregate	AGG	Transformation
Audit	AUD	Transformation
Balanced Data Distributor	BDD	Transformation	*
Cache Transform	CCH	Transformation	*
CDC Splitter	CDCS	Transformation	*
Character Map	CHM	Transformation
Conditional Split	CSPL	Transformation
Copy Column	CPYC	Transformation
Data Conversion	DCNV	Transformation
Data Mining Query	DMQ	Transformation
Derived Column	DER	Transformation
DQS Cleansing	DQSC	Transformation	*
Export Column	EXPC	Transformation
Fuzzy Grouping	FZG	Transformation
Fuzzy Lookup	FZL	Transformation
Import Column	IMPC	Transformation
Lookup	LKP	Transformation
Merge	MRG	Transformation
Merge Join	MRGJ	Transformation
Multicast	MLT	Transformation
OLE DB Command	CMD	Transformation
Percentage Sampling	PSMP	Transformation
Pivot	PVT	Transformation
Row Count	CNT	Transformation
Row Sampling	RSMP	Transformation
Script Component	SCR	Transformation
Slowly Changing Dimension	SCD	Transformation
Sort	SRT	Transformation
Term Extraction	TEX	Transformation
Term Lookup	TEL	Transformation
Union All	ALL	Transformation
Unpivot	UPVT	Transformation
ADO NET Destination	ADO_DST	Destination	*
Azure Blob Destination	AB_DST	Destination	*
Data Mining Model Training	DMMT_DST	Destination
Data Streaming Destination	DS_DST	Destination	*
DataReaderDest	DR_DST	Destination
Dimension Processing	DP_DST	Destination
Excel Destination	EX_DST	Destination
Flat File Destination	FF_DST	Destination
HDFS File Destination	HDFS_DST	Destination	*
ODBC Destination	ODBC_DST	Destination	*
OLE DB Destination	OLE_DST	Destination
Partition Processing	PP_DST	Destination
Raw File Destination	RF_DST	Destination
Recordset Destination	RS_DST	Destination
SharePoint List Destination	SPL_DST	Destination
SQL Server Compact Destination	SSC_DST	Destination	*
SQL Server Destination	SS_DST	Destination

Example of the prefixes

Sunday, 5 October 2014

SQL Saturday #336 Holland - Powerpointslides

Had a nice day at SQL Saturday #336 in Utrecht! The PowerPoint slides of my SSIS Development Best Practices session are available for download. I added some screens, text and URL's for additional information (see notes in PowerPoint)

Saturday, 1 March 2014

SSIS 2012 with Team Foundation Server - Part II

Case
I have installed Team Explorer and setup Visual Studio to use it. What's next?

Solution
In Part I you read:
A) Install Team Explorer for Visual Studio 2010
B) Install Team Explorer for Visual Studio 2012
C) Setup Visual Studio to use TFS

This second part covers:
D) Adjusting development process

D) Adjusting development process
Because you can now work with multiple developers on the same project, you have to make some arrangements with your fellow developers, like:

1) Get latest version project
Get the latest version of the project on a regular basis. Otherwise you will miss new packages, project connection managers and project parameters. Do this for example each morning or before you start developing. There is also an option in Visual Studio to automatically get the latest version of the solution when opening it.

Get everything when a solution or project is opened.

2) Get latest version package
Get the latest version of a package before editing it. There is also an option in Visual Studio to automatically get the latest version of a package when checking it out.

Get latest version of item on check out.

3) Adding new package to project
When you add a new package to the project, the project self will be checked out. First first rename the new package, save it and then check in the project and the new (empty/clean) package. Otherwise your fellow developers cannot change project properties or add new packages.

Adding new package will check out the project

4) Disable multiple check out
Working together on the same file at the same time is nearly impossible, because it's hard to merge the XML of two versions of a package. Therefore you should disable multiple check out in TFS or check out your package exclusively (not the default in TFS).

In Team-menu click Team Project Settings, Source Control

Uncheck the multiple checkout box

5) Don't check in faulty packages
Try not to check in package that doesn't work. Especially when you work with the project deployment model, with which you can only deploy the complete project.

Don't check in faulty packages

6) No large/complex packages
Don’t make packages to large/complex. Divide the functionality over multiple smaller packages, because you can’t work with multiple developers on the same large package at the same time.

7) Sensitive data
The default Package Protection Level is EncryptSensitiveWithUserKey. This will encrypt passwords and other sensitive data in the package with the username of the developer. Because your colleagues will probably have different usernames they can't edit or execute packages that you made without re-entering all sensitive package data.
The easiest way to overcome this, is to use DontSaveSensitive as Package Protection Level in combination with Package Configurations. Then all the sensitive data will be stored in the configuration table or file and when you open the package all this data will be retrieved from the configuration table or file.
If you're using the Project Deployment Model in combination with sensitive parameters instead of Package Configuration, then the easiest workaround is to use EncryptAllWithPassword or EncryptSensitiveWithPassword with a password that is known within the developmentteam.

8) Development standards
When you're developing with multiple people (or someone else is going to maintain your work) then it's good to have some Development Best Practices like using prefixes for tasks and transformations or using templates. This makes it easier to transfer work and to collaborate as a team.

9) Comments
When you check in a package, it's very useful to add a meaningful description of the change. This makes it easier to track history.

Check in comments

10) Branching, Labeling and building
Beside versioning and checking in/out packages there are more interesting functions in TFS that are probably more common in C# and VB.Net programming, but worth checking out. Here are some interesting links about TFS and SSIS:

SSIS & TFS general:
http://msdn.microsoft.com/en-us/library/dn463982.aspx
Branching:
http://www.mattmasson.com/2012/02/thoughts-on-branching-strategies-for-ssis-projects/
SSIS team development:
http://consultingblogs.emc.com/jamiethomson/archive/2007/08/06/SSIS_3A00_-Team-Development-Experiences.aspx
SSIS release and Source Control:
http://phil-austin.blogspot.com/2012/11/ssis-release-and-source-control.html
Building packages on TFS:
https://blogs.blackmarble.co.uk/blogs/rfennell/post/2013/04/24/Getting-SQL-2012-SSIS-packages-built-on-TFS-20122.aspx

Friday, 24 December 2010

Development Best Practices

Case
As an external employee I see a lot of SSIS packages at various companies made by a whole bunch of different people. Unfortunately some of those people made Quick & Dirty as a motto in life resulting in hard to read packages. And that's a waste of time for the companies.

Solution
Companies should require both well performing and well documented packages. Here is a list of some basic development Best Practices to achieve clear and manageable packages.

1) No default names and descriptions
Rename all default component names and give them explaining descriptions. This will help other developers that edit your packages. It is also very useful when debugging.

No default names and descriptions

2) Annotations
Use annotations. This is very useful if the Control Flow or Data Flow isn't self describing (for others).

Use annotations

3 Group logical work
Use Sequence containers to organize package structures into logical units of work. This makes it easier to identify what the package does. It also helps to control transactions if they are being implemented. * Update: SSIS 2012 has a grouping feature *

Use Sequence Containers

4 Flow directions
Flows should basically go top-down. This will make your packages more readable.

Design your package Top down

You can use the Auto-format option from SSIS to format your packages

Auto Layout is a good start

5) Disabled Control Flow tasks
Do not use disabled Control Flow tasks in the Quality assurance or Production environment. If you want to conditionally execute a task at runtime use expressions on your precedence constraints. Do not use an expression on the “Disable” property of the task.

Disabled Control Flow Task

6) Spread large number of packages over serveral Visual Studio Project
You can add more than one projects to your Visual Studio Solution to spread large number of packages. Think about a proper layout. For example a datastaging project and a datawarehouse project.

7) Queries in source and look up components
Don't use too complex queries. Use a readable lay-out and add comments to explain parts of the query. For example:

-- This query does something 
SELECT    a.field1
,         a.field2
,         b.field3
,         b.field4
FROM      table1 as a
LEFT JOIN table2 as b
          on a.field5 = b.field6
WHERE     a.field2 = 'x' -- Comment about x
ORDER BY  a.field1

8) Script Coding Conventions
Use condings conventions when scripting a script task or component. C# and VB.Net both have their own conventions which are widely available on the net.

9) Use naming conventions
Give tasks and transformations a prefix. This makes it easier to read the logging.

10) Use templates
You can create templates for SSIS. Things like logging, configurations and connection managers can be added to these templates.

Let me known if you have items that should be in the list of Development Best Practices!

Wednesday, 22 December 2010

Performance Best Practices

Case
A client of mine had some performance issues with couple of SSIS packages and because they lack basic SSIS knowledge, they just upgraded there server with more memory. Finally, after 32GB of memory, they stopped upgrading and start reviewing there packages.

Solution
There are a lot of blogs about SSIS Best Practices (for instance: SSIS junkie). Here is the top 10 of the easy to implement but very effective ones I showed them to 'upgrade' their packages instead of the memory.

1) Unnecessary columns
Select only the columns that you need in the pipeline to reduce buffer size and reduce OnWarning events at execution time. SSIS even helps you by showing the unnecessary ones in the Progress/Execution Results Tab: [DTS.Pipeline] Warning: The output column "Address1" (16161) on output "Output0" (16155) and component "CRM clients" (16139) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance.

Unnecessary columns from a flat file

2) Use queries instead of tables
Following on the unnecessary columns, always use a SQL statement in an OLE DB Source component or (Fuzzy) Lookup component rather than just selecting a table. Selecting a table is akin to "SELECT *..." which is universally recognised as bad practice.

OLE DB Source, use SQL Command instead of Table

Lookup, use SQL Command instead of Table

3) Use caching in your LOOKUP
Make sure that the result of your lookup is unique, otherwise SSIS cannot cache the query and executes it for each record passing the lookup component. SSIS will warn you for this in the Progress/Execution Results Tab: [Lookup Time Dimension [605]] Warning: The component "Lookup Time Dimension" (605) encountered duplicate reference key values when caching reference data. This error occurs in Full Cache mode only. Either remove the duplicate key values, or change the cache mode to PARTIAL or NO_CACHE.

Watch out that you are not grabbing too many resources in the lookup. A couple of million records is probably not a good idea. And new is SSIS 2008 is that you can reuse your lookup cache in an other lookup.

SSIS 2008: Cache

4) Filter in source
Where possible filter your data in the Source Adapter rather than filter the data using a Conditional Split transform component. This will make your data flow perform quicker because the unnecessary records don't go through the pipeline.

Filter in OLE DB Source, filter data in source

5) Sort in source
A sort with SQL Server is faster than the sort in SSIS, partly because SSIS does the sort in memory. So it pays to move the sort to a source component (where possible). Note you have to set IsSorted=TRUE on the source adapter output, but setting this value does not perform a sort operation; it only indicates that the data it sorted. After that change the SortKeyPosition of all output columns that are sorted.

Advanced Editor for Source, sort data in source

6) Join in source
Where possible, join data in the Source Adapter rather than using the Merge Join component. SQL Server does it faster than SSIS. But watch out that you are not making to complex queries because that will worsen the readability.

Unnecessary Join and Sorts

7) Group in source
Where possible, aggregate your data in the Source Adapter rather than using the Aggregate component. SQL Server does it faster than SSIS.

Unnecessary Sorts, Join and Aggregate

8) Beware of Non-blocking, Semi-blocking and Fully-blocking components in general
The dataflow consists of three types of transformations: Non-blocking, Semi-blocking and Fully-blocking. And as the names suggests, use Semi-blocking and Fully-blocking components rarly to optimize your packages. Jorg Klein has written a interesting article about it with a list of which component is non-, semi- or fully blocking.

A summary of how to recognize these three types:

	Non-blocking	Semi-blocking	Fully-blocking
Synchronous/asynchronous	Synchronous	Asynchronous	Asynchronous
Number of rows in equal to rows out	True	Usually False	Usually False
Collect all input before the can output	False	False	True
New buffer created?	False	True	True
New thread created?	False	Usually True	True

Find more information about (a)synchronous at Microsoft.

9) High Volumes of Data and indexes
Loading high volumes of data on a table with clustered and non-clustered indexes could take a lot of time.
The most important thing to verify is if all indexes are really used. SQL Server 2005 and 2008 provide information about index usage with to views: sys.dm_db_index_operational_stats and sys.dm_db_index_usage_stats. Drop all rarely used and unused indexes first. Experience teaches that there are often a lot of unnecessary indexes. If you are absolute sure that all remaining indexes are necessary you can drop all indexes before loading the data and to recreate them afterwards. The performance profit of that depends on the number of records. The higher the number of records the more profit you gain.

Drop and recreate indexes

10) SQL Server Destination Adapter vs OLE DB Destination Adapter
If your target database is a local SQL server database, the SQL Server Destination Adapter will perform much better than the OLE DB Destination Adapter. However the SQL Server Destination Adapter works only on a local machine and via Windows security. You have to be absolute sure that your database stays local in the future otherwise you mapping will not work when moving the database.

Note: this is not a complete list, but just a top 10 of easy to implement but very effective ones. Tell me if you have items that should be in the top 10 of Performance Best Practices!

Note: Besides the Performance Best Practice there also is a Development Best Practice.