Showing posts with label REGEX. Show all posts
Showing posts with label REGEX. Show all posts

Tuesday, 17 April 2012

Custom SSIS Component: Foreach File Enumerator with regex

In some cases the standard wildcard filter in the Foreach File Enumerator isn't flexible enough. Try to get all files ending with a number or all files starting with an 'a' or 'b'. A filter based on a regular expression gives more freedom. See/vote this request at Microsoft Connect.

My .Net colleague helped me to create a File Enumerator where I replaced the standard wildcard filter by a regular expression filter. And we also added a sorting possibility.
Regular Expression Support























This Regex File Enumerator also alows you to select multiple extensions like *.xls + *.csv:
^.*\.(xls|csv)$

Disclaimer
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Downloads:
You can download a SSIS 2008 version on the download page. All my custom 2012 enumerators have an optional regex filter included.

Installation
The installer registers the DLL in the GAC and copies it to the ForEachEnumerators folder of SSIS (example: C:\Program Files\Microsoft SQL Server\100\DTS\ForEachEnumerators\). After that you have to close Visual Studio/BIDS because it caches the GAC on startup.

SSIS
The enumerator is automatically added to the Foreach Loop when you open a SSIS project.
The new enumerator




























Bugs or feature suggestions
Please contact me or leave a comment if you have any suggestions or found a bug in this Custom Enumerator.

Sunday, 15 April 2012

Regex filter for Foreach Loop

Case
The standard Foreach Loop File Enumerator only has a wildcard filter, but in some cases a regular expression filter would be more useful. See/vote this request at Microsoft Connect.

Solution
At the moment I'm working on a custom File Enumerator with regular expression support, but for those who don't want to wait or don't want to use custom components... You could achieve the same result with a Script Task.

*update: Now also available as Custom Foreach Enumerator. *

1) Foreach Loop
For this case I will use a standard Foreach File Enumerator that fills a variable FilePath. The filter is *.* so all files will be returned.
Standard foreach loop


















2) Variables
I will use two extra variables in this example: RegexFilter (string) for storing the regular expression and PassesRegexFilter (boolean) for indicating whether the filename passes the regular expression filter.
Variables













3) Script Task
Add a Script Task in front of the Data Flow Task and give it a suitable name.
Script Task






















4) Add variables
Edit the Script Task and add the FilePath and RegexFilter as ReadOnlyVariables and the PassesRegexFilter as ReadWriteVariable.
Variables






















5) The Script
Copy the following script to the Script Task.
// C# Code for filtering filenames with Regex
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
using System.Text.RegularExpressions;   // Added
using System.IO;                        // Added

namespace ST_02b6595da2274d7182409fb43af929ae.csproj
{
    [System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
    public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
    {
        #region VSTA generated code
        enum ScriptResults
        {
            Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
            Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
        };
        #endregion

        public void Main()
        {
            // Get the filename from the complete filepath
            String FileName = Path.GetFileName(Dts.Variables["User::FilePath"].Value.ToString());

            // Create a regex object with the pattern from the SSIS variable
            Regex FileFilterRegex = new Regex(Dts.Variables["User::RegexFilter"].Value.ToString());
            
            // Check if it is match and return that value (boolean) to the SSIS variable
            Dts.Variables["User::PassesRegexFilter"].Value = FileFilterRegex.IsMatch(FileName); 
            
            Dts.TaskResult = (int)ScriptResults.Success;
        }
    }
}
or with VB.Net
' VB.Net Code for filtering filenames with Regex
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.Text.RegularExpressions   ' Added
Imports System.IO                        ' Added

<System.AddIn.AddIn("ScriptMain", Version:="1.0", Publisher:="", Description:="")> _
<System.CLSCompliantAttribute(False)> _
Partial Public Class ScriptMain
 Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase

 Enum ScriptResults
  Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success
  Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
 End Enum
 
    Public Sub Main()
        ' Get the filename from the complete filepath
        Dim FileName As String = Path.GetFileName(Dts.Variables("User::FilePath").Value.ToString())

        ' Create a regex object with the pattern from the SSIS variable
        Dim FileFilterRegex As Regex = New Regex(Dts.Variables("User::RegexFilter").Value.ToString())

        ' Check if it is match and return that value (boolean) to the SSIS variable
        Dts.Variables("User::PassesRegexFilter").Value = FileFilterRegex.IsMatch(FileName)

        Dts.TaskResult = ScriptResults.Success
    End Sub

End Class


6) Precedence Constraints Expression
Connect the Script Task to the Data Flow Task and add an expression that does the real filtering:
@[User::PassesRegexFilter] == true
Precedence Constraint with expression





















7) The result
For testing the result, I added an other Script Task below the Data Flow Task that logs all files. My regular expression in this test case: .*[0-9]\.csv$    (all csv files that end with a number in the filename)
The result























Note 1: to gain some performance I should have changed the Foreach Loop File Enumerator wildcard from *.* to *.csv

Note 2: this method could be a bit inefficient if your file collection contains thousands of files and you only need two of them.

Saturday, 5 March 2011

Cleaning with Regular Expressions in SSIS

Case
I have a column with html tags in the value. How can I remove those tags?

Solution
You could try the make an expression in a derived column, but it would probably become an unreadable expression. There are a couple of alternatives like the Konesans RegexClean Transformation. Links to more third party and open source components are availble in a previous article. And if your source is a database then you could use a user defined function, but  in this case we will be using a .Net Regular Expression within a Script Component to remove the html tags.

File example:
"Name";"Message"
"Joost";"Hallo, <strong>this</strong> is bold."
"William";"This is my homepage: <a href='http://microsoft-ssis.blogspot.com/'>http://microsoft-ssis.blogspot.com/</a>."


1) Source
Add a Flat File Source Component for the textfile above.
Flat File Source
















2) Script Component
Add a Script Component type transformation below the Flat File Source and give it a suitable name.
Script Component - transformation






















3) Input Column
Edit the Script Component and select the message column as ReadWrite on the Input Columns tab.
Message as ReadWrite input column



















4) The Script
SSIS creates some standard methods, but we only use Input0_ProcessInputRow. You can remove the rest. We are also adding a new method that removes the html tags. It uses a .Net Regex method. You can search the web for your own regex pattern and change it in the extra method.
// C# code
using System;
using System.Data;
using System.Text.RegularExpressions;    // Added
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    // Method that will be executed for each row.
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        // Replace the value of the message column
        Row.Message = RemoveHtml(Row.Message);
    }

    // A string method that removes html tags
    // with a regex pattern
    public String RemoveHtml(String message)
    {
        // The pattern for a html tag
        String htmlTagPattern = "<(.|\n)+?>";
        // Create a regex object with the pattern 
        Regex objRegExp = new Regex(htmlTagPattern);
        // Replace html tag by an empty string
        message = objRegExp.Replace(message, String.Empty);
        // Return the message without html tags
        return message;
    }
}

or VB.Net

' VB.Net code
Imports System
Imports System.Data
Imports System.Math
Imports System.Text.RegularExpressions      ' Added
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute()> _
<CLSCompliant(False)> _
Public Class ScriptMain
    Inherits UserComponent

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        ' Replace the value of the message column
        Row.Message = RemoveHtml(Row.Message)
    End Sub

    ' A string method that removes html tags
    ' with a regex pattern
    Public Function RemoveHtml(message As [String]) As [String]
        ' The pattern for a html tag
        Dim htmlTagPattern As [String] = "<(.|" & vbLf & ")+?>"
        ' Create a regex object with the pattern 
        Dim objRegExp As New Regex(htmlTagPattern)
        ' Replace html tag by an empty string
        message = objRegExp.Replace(message, [String].Empty)
        ' Return the message without html tags
        Return message
    End Function
End Class


Note: this script uses a very simple regular expression. Search the web for other/better regular expressions.

5) The Result
For testing purposes I added a Derived Columns and two data viewes on the Data Flow Paths.


















SSIS 2008 Package example

Friday, 4 March 2011

Regular Expressions in SSIS

Case
I want to use a Regular Expression to filter incorrect emailaddresses in SSIS.

Solution
There are a couple of options:
Let's elaborate the last option and use this file as a source:
"Name";"emailaddress"
"Joost";"joost@blogger.com"
"John";"john@.nl"
"Max";"max@gmail.com"
"William";"william@hotmail,com"
"Bill";"b.gates@microsoft.com"

1) Add source
Add a Flat File Source Component in your dataflow for the file example above.
Flat File mailinglist
















2) Add Script Component
Add a Script Component, type transformation, after your Flat File Source Component.
Script Component (transformation)


















3) Input Columns
 Edit the Script Component and select the email address column as an input column (readonly).
Input Columns tab



















4) Outputs
Change the name of the default output port to CorrectEmail and set the ExclusionGroup to 1. Add an extra output port for the IncorrectEmail and set the ExclusionGroup to 1. Also select the input port as the SynchronousInputID. Now you have two output ports. One for the correct and one for the incorrect email address.
Inputs and Outputs tab













4) The Script
Hit the Edit Script button to open the VSTA editor. SSIS creates some standard methods, but we only use Input0_ProcessInputRow. You can remove the rest of the methods. We are also adding a new method that validates the email address. It uses a .Net Regex method. You can search the web for your own regex pattern and change it in the extra method.
// C# code
using System;
using System.Data;
using System.Text.RegularExpressions;   // Added
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{

    // Method that will be executed for each row.
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        // Validate the email address
        if (IsCorrectEmail(Row.Email))
        {
            // Redirect the correct email address
            Row.DirectRowToCorrectEmail();
        }
        else
        {
            // Redirect the incorrect email address
            Row.DirectRowToInCorrectEmail();
        }
    }
    
    // A boolean method that validates an email address
    // with a regex pattern.
    public bool IsCorrectEmail(String emailAddress)
    {
        // The pattern for email
        string emailAddressPattern = @"^(([^<>()[\]\\.,;:\s@\""]+"
                        + @"(\.[^<>()[\]\\.,;:\s@\""]+)*)|(\"".+\""))@"
                        + @"((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"
                        + @"\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+"
                        + @"[a-zA-Z]{2,}))$";
        // Create a regex object with the pattern
        Regex emailAddressRegex = new Regex(emailAddressPattern);
        // Check if it is match and return that value (boolean)
        return emailAddressRegex.IsMatch(emailAddress);
    }
}

' VB.Net code (SSIS 2008)
Imports System
Imports System.Data
Imports System.Math
Imports System.Text.RegularExpressions    ' Added
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

 _
 _
Public Class ScriptMain
    Inherits UserComponent

    ' Method that will be executed for each row.
    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        ' Validate the email address
        If (IsCorrectEmail(Row.Email)) Then
            ' Redirect the correct email address
            Row.DirectRowToCorrectEmail()
        Else
            ' Redirect the incorrect email address
            Row.DirectRowToIncorrectEmail()
        End If
    End Sub

    ' A boolean method that validates an email address
    ' with a regex pattern.
    Public Function IsCorrectEmail(ByVal emailAddress As String) As Boolean
        ' The pattern for email
        Dim emailAddressPattern As String = "^(([^<>()[\]\\.,;:\s@\""]+" + _
        "(\.[^<>()[\]\\.,;:\s@\""]+)*)|(\"".+\""))@" + _
        "((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" + _
        "\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+" + _
        "[a-zA-Z]{2,}))$"

        ' Create a regex object with the pattern
        Dim emailAddressRegex As Regex = New Regex(emailAddressPattern)

        ' Check if it is match and return that value (boolean)
        Return emailAddressRegex.IsMatch(emailAddress)
    End Function

End Class


5) The result
For testing purposes I added two Derived Columns with data viewes on the Data Flow Paths.
The result














Jamie Thomson, alias SSIS Junkie, has an other example in VB.net available. And you can also use Regular Expression to clean column values. More about that is described in this article.
Related Posts Plugin for WordPress, Blogger...