Thursday 29 December 2011

ProperCase in SSIS

Case
I have a source where all data is in uppercase and I want to use ProperCase (similar synonyms: UpperCamelCase, PascalCase, TitleCase, StartCase) instead. How do I do that in SSIS?
All uppercase








Solution
A) If your source is a SQL Server table then you could use a User-Defined Function with some TSQL code (example 1, example 2) to proper case your data.

B) You could come up with some clever and complex, but probably unreadable expression.

C) You could use a Script Component with one line of code for the actual proper case.

Let’s elaborate that Script Component solution. VB.net has a standard method for it: StrConv, but it’s also possible in C#.

1) Add Script Component
Go to the Data Flow and add a Script Component (type transformation) after your source. Connect it to your source and give it a suitable name.
Script Component (type transformation)
























2) Input Columns
Edit the Script Component and go to the Input Columns tab. Select the columns that you want to change as ReadWrite.
Input Columns: ReadWrite























3) The Script
Select the programming language (Visual Basic), edit the Script and copy the following VB.net code.
' VB.Net code
' The standard VB.net string function to convert
' a string to ProperCase (also called TitleCase)
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute> _
<CLSCompliant(False)> _
Public Class ScriptMain
    Inherits UserComponent

    ' Just one line of code foreach column
    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        Row.FirstName = StrConv(Row.FirstName, VbStrConv.ProperCase)
        Row.LastName = StrConv(Row.LastName, VbStrConv.ProperCase)
    End Sub

End Class

or use the official C# variant:
// C# code
// The C# variant of Proper Case (also title case)
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.Globalization;     // Added
using System.Threading;         // Added

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        // C# alternative for VB.Net StrConv uses culterinfo and threading
        // See: http://support.microsoft.com/kb/312890/
        CultureInfo cultureInfo = Thread.CurrentThread.CurrentCulture;
        TextInfo textInfo = cultureInfo.TextInfo;

        // The actual Proper Casing
        Row.FirstName = textInfo.ToTitleCase(Row.FirstName);
        Row.LastName = textInfo.ToTitleCase(Row.LastName);

        // Update: Or try this!
        // Row.FirstName = new System.Globalization.CultureInfo("en").TextInfo.ToTitleCase(Row.FirstName);
        // Row.LastName = new System.Globalization.CultureInfo("en").TextInfo.ToTitleCase(Row.LastName);
    }
}

NOTE: for some reason it doesn't work if all chars are uppercase, but it's also possible to use the VB.net VbStrConv.ProperCase in your C# code by adding a reference to Visual Basic.Net

// C# code
// Using the Visual Basic VbStrConv.ProperCase
// by adding a reference to Visual Basic.
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using Microsoft.VisualBasic;         // Added
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        Row.FirstName = Strings.StrConv(Row.FirstName, VbStrConv.ProperCase, 0);
        Row.LastName = Strings.StrConv(Row.LastName, VbStrConv.ProperCase, 0);
    }
}

Here is how you add the Visual Basic Reference.
Adding Visual Basic reference


















4) The result
I added a destination and two Data Viewers for testing purposes.
The Result

Saturday 3 December 2011

Redirect duplicate rows

Case
I have duplicate records in my source which I want to redirect to a separate destination in my data flow task. Most solutions like the Aggregate Transformation or the Sort Transformation with "Remove rows with duplicate sort values"  selected, will remove the records instead of redirecting them.

Example: If there are three identical records (two columns), then two of them should be redirected as duplicate so that I can capture them in an error destination.

Solution
This solution uses a TSQL order by or a SSIS sort in combination with a Script Component to detect and redirect all duplicate records.

If your source is a SQL Server table you could probably come up with a fancy TSQL deduplication script to do the same in less processing time, but this solution is easy and also works for other kind of sources. Even if for example you only want to compare two out of five columns.

1) Source and Sort
Add your source to the Data Flow Task. You need to sort on the columns you want to use for deduplication. If your source is a database table, then you should add a GROUP BY clause to the query else you should add a SORT Transformation right after your source.
Add a sorted source


















2) Add Script Component
Add a Script Component type Transformation after your sorted source and give it a suitable name.
Add Script Component





















3) Edit Script Component: Input Columns
Edit the Script Component and go to the Input Columns tab and select all the columns you want to use for deduplication as ReadOnly (the same fields as you used for sorting).
Select columns.Type doesn't matter.





















4) Edit Script Components: Inputs and Outputs, Edit Output
Go to the Inputs and Outputs tab and change the name of the Output port from "Output 0" to "Unique". Also change the ExclusionGroup to 1.
Rename default output port





















5) Edit Script Components: Inputs and Outputs, New output
Add a new Output and give it the name Duplicate and change the ExlusionGroup to 1. To connect this new output port to the input port, change the SynchronousInputID property and select the input port.
New Output port






















6) The Script
Edit the Script and copy the following code. This script uses reflection to get all selected columns so that you don't have to change the script if you change the input columns. But read the coding comments.
// C# code
// This script automaticly compares the selected columns, but there is one 'bug':
// You have to edit and close this script again if you change input columns.
using System;
using System.Data;
using System.Reflection;                                // Added
using System.Text;                                      // Added
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    // Create a variable to store the concatenated values for the previous row
    string previousRow = "";

    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        // Create a variable to store the concatenated values of current row
        StringBuilder currentRow = new StringBuilder();

        foreach (PropertyInfo p in Row.GetType().GetProperties())
        {
            // We can't use the _IsNull columns, so ignore them. Also ignore the new output column
            if ((p.Name.ToLower().EndsWith("_IsNull") == false) && (p.Name.Equals("Duplicate") == false))
            {
                try
                {
                    // Concatenate value as string to variable
                    currentRow.Append(p.GetValue(Row, null).ToString() + "|");
                }
                catch (ArgumentException)
                {
                    // If the value is NULL (empty) then you can't get the value of it
                    currentRow.Append("NULL|");
                }
                catch (Exception ex)
                {
                    // Raise error because something unexpected went wrong
                    bool pbCancel = false;
                    this.ComponentMetaData.FireError(0, "MarkDuplicates", p.Name + ": " + ex.Message, string.Empty, 0, out pbCancel);
                }
            }
        }

        // Check if the current row and previous row are the same
        if (currentRow.ToString().Equals(previousRow))
        {
            // Redirect to duplicate output
            Row.DirectRowToDuplicate();
        }
        else
        {
            // Redirect to unique output
            Row.DirectRowToUnique();
        }

        // Fill previous row with current value for next check
        previousRow = currentRow.ToString();
    }
}

or VB.Net
' VB.Net code
' This script automaticly compares the selected columns, but there is one 'bug':
' You have to edit and close this script again if you change input columns.

Imports System
Imports System.Data
Imports System.Math
Imports System.Reflection                           ' Added
Imports System.Text                                 ' Added
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute> _
<CLSCompliant(False)> _
Public Class ScriptMain
    Inherits UserComponent

    ' Create a variable to store the concatenated values for the previous row
    Dim previousRow As String = ""

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        ' Create a variable to store the concatenated values of current row
        Dim currentRow As StringBuilder = New StringBuilder()

        Dim p As PropertyInfo
        For Each p In Row.GetType().GetProperties()
            ' We can't use the _IsNull columns, so ignore them. Also ignore the new output column
            If ((p.Name.ToLower().EndsWith("_IsNull") = False) And (p.Name.Equals("Duplicate") = False)) Then
                Try
                    ' Concatenate value as string to variable
                    currentRow.Append(p.GetValue(Row, Nothing).ToString() + "|")
                Catch ex As ArgumentException
                    ' If the value is NULL (empty) then you can't get the value of it
                    currentRow.Append("NULL|")
                Catch ex As Exception
                    ' Raise error because something unexpected went wrong
                    Dim pbCancel As Boolean
                    Me.ComponentMetaData.FireError(0, "MarkDuplicates", p.Name + ": " + ex.Message, String.Empty, 0, pbCancel)
                End Try
            End If
        Next

        ' Check if the current row and previous row are the same
        If (currentRow.ToString().Equals(previousRow)) Then
            ' Redirect to duplicate output
            Row.DirectRowToDuplicate()
        Else
            ' Redirect to unique output
            Row.DirectRowToUnique()
        End If

        ' Fill previous row with current value for next check
        previousRow = currentRow.ToString()
    End Sub
End Class


6) Destinations
Add two destinations and connect the Data Flow Paths to the destinations. When connecting you will have to select the output port.
Choose right output















7) Testing
I added some dataviewers for testing purposes and selected the first three columns for deduplication.
The Result

























All roads lead to Rome, so let me know what your solution is!
Update: here is an other cool solution that doesn't need sorting.
Related Posts Plugin for WordPress, Blogger...