Friday, 4 March 2011

Regular Expressions in SSIS

Case
I want to use a Regular Expression to filter incorrect emailaddresses in SSIS.

Solution
There are a couple of options:
Let's elaborate the last option and use this file as a source:
"Name";"emailaddress"
"Joost";"joost@blogger.com"
"John";"john@.nl"
"Max";"max@gmail.com"
"William";"william@hotmail,com"
"Bill";"b.gates@microsoft.com"

1) Add source
Add a Flat File Source Component in your dataflow for the file example above.
Flat File mailinglist
















2) Add Script Component
Add a Script Component, type transformation, after your Flat File Source Component.
Script Component (transformation)


















3) Input Columns
 Edit the Script Component and select the email address column as an input column (readonly).
Input Columns tab



















4) Outputs
Change the name of the default output port to CorrectEmail and set the ExclusionGroup to 1. Add an extra output port for the IncorrectEmail and set the ExclusionGroup to 1. Also select the input port as the SynchronousInputID. Now you have two output ports. One for the correct and one for the incorrect email address.
Inputs and Outputs tab













4) The Script
Hit the Edit Script button to open the VSTA editor. SSIS creates some standard methods, but we only use Input0_ProcessInputRow. You can remove the rest of the methods. We are also adding a new method that validates the email address. It uses a .Net Regex method. You can search the web for your own regex pattern and change it in the extra method.
// C# code
using System;
using System.Data;
using System.Text.RegularExpressions;   // Added
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{

    // Method that will be executed for each row.
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        // Validate the email address
        if (IsCorrectEmail(Row.Email))
        {
            // Redirect the correct email address
            Row.DirectRowToCorrectEmail();
        }
        else
        {
            // Redirect the incorrect email address
            Row.DirectRowToInCorrectEmail();
        }
    }
    
    // A boolean method that validates an email address
    // with a regex pattern.
    public bool IsCorrectEmail(String emailAddress)
    {
        // The pattern for email
        string emailAddressPattern = @"^(([^<>()[\]\\.,;:\s@\""]+"
                        + @"(\.[^<>()[\]\\.,;:\s@\""]+)*)|(\"".+\""))@"
                        + @"((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"
                        + @"\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+"
                        + @"[a-zA-Z]{2,}))$";
        // Create a regex object with the pattern
        Regex emailAddressRegex = new Regex(emailAddressPattern);
        // Check if it is match and return that value (boolean)
        return emailAddressRegex.IsMatch(emailAddress);
    }
}

' VB.Net code (SSIS 2008)
Imports System
Imports System.Data
Imports System.Math
Imports System.Text.RegularExpressions    ' Added
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

 _
 _
Public Class ScriptMain
    Inherits UserComponent

    ' Method that will be executed for each row.
    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        ' Validate the email address
        If (IsCorrectEmail(Row.Email)) Then
            ' Redirect the correct email address
            Row.DirectRowToCorrectEmail()
        Else
            ' Redirect the incorrect email address
            Row.DirectRowToIncorrectEmail()
        End If
    End Sub

    ' A boolean method that validates an email address
    ' with a regex pattern.
    Public Function IsCorrectEmail(ByVal emailAddress As String) As Boolean
        ' The pattern for email
        Dim emailAddressPattern As String = "^(([^<>()[\]\\.,;:\s@\""]+" + _
        "(\.[^<>()[\]\\.,;:\s@\""]+)*)|(\"".+\""))@" + _
        "((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" + _
        "\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+" + _
        "[a-zA-Z]{2,}))$"

        ' Create a regex object with the pattern
        Dim emailAddressRegex As Regex = New Regex(emailAddressPattern)

        ' Check if it is match and return that value (boolean)
        Return emailAddressRegex.IsMatch(emailAddress)
    End Function

End Class


5) The result
For testing purposes I added two Derived Columns with data viewes on the Data Flow Paths.
The result














Jamie Thomson, alias SSIS Junkie, has an other example in VB.net available. And you can also use Regular Expression to clean column values. More about that is described in this article.

8 comments:

  1. Keep in mind I'm not a C# programmer, but your code doesn't work for me. I get too many errors when placing in the code of a script task. Obviously this is most likely because you ommited to explain how this is accomplished in your example. I would like to see you put a little more step by step information instead of assuming that all of us reading your blog know what you know.
    Thanks
    Joe

    ReplyDelete
  2. @Kevin Ward: Thanks for the feedback. Try the example of SSIS MVP Jamie Thomson. May be you find his example easier.

    And an even better option for non-programmers is to use one of the third party/open source components I mentioned. Then you don't need my difficult coding.

    If you do want to use my coding. Let me know what errors you got or where you lost me.

    ReplyDelete
  3. can you plz tell me where to place c# code???

    ReplyDelete
    Replies
    1. Double click the script component and hit the Edit Script-button on the first page.

      Delete
  4. Hi Joost

    Thanks for the script. I created the script, it runs ok, but I do not see my data after the scripting step. I am trying to validate a 5 digits US zip code (12345), I used "^\\d{5}$" as the zipCode pattern.

    Best Regards

    ReplyDelete
  5. For my previous question, I got it to work; the problem is that I forgot to save the script after I ran it, believing that it will be saved automatically.

    I followed the tutorial and got my package to work.


    Thank you.

    ReplyDelete
    Replies
    1. Code in the .cs or .vb file is automatically saved when you close the vsta editor. Only changes in the internal script component project have to be saved by pressing Save All. For example when you add a reference to a dll.

      Delete

Please use the SSIS MSDN forum for general SSIS questions that are not about this post. I'm a regular reader of that forum and will gladly answer those questions over there.

All comments are moderated manually to prevent spam.