Sunday, 15 April 2012

Regex filter for Foreach Loop

Case
The standard Foreach Loop File Enumerator only has a wildcard filter, but in some cases a regular expression filter would be more useful. See/vote this request at Microsoft Connect.

Solution
At the moment I'm working on a custom File Enumerator with regular expression support, but for those who don't want to wait or don't want to use custom components... You could achieve the same result with a Script Task.

*update: Now also available as Custom Foreach Enumerator. *

1) Foreach Loop
For this case I will use a standard Foreach File Enumerator that fills a variable FilePath. The filter is *.* so all files will be returned.
Standard foreach loop


















2) Variables
I will use two extra variables in this example: RegexFilter (string) for storing the regular expression and PassesRegexFilter (boolean) for indicating whether the filename passes the regular expression filter.
Variables













3) Script Task
Add a Script Task in front of the Data Flow Task and give it a suitable name.
Script Task






















4) Add variables
Edit the Script Task and add the FilePath and RegexFilter as ReadOnlyVariables and the PassesRegexFilter as ReadWriteVariable.
Variables






















5) The Script
Copy the following script to the Script Task.
// C# Code for filtering filenames with Regex
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
using System.Text.RegularExpressions;   // Added
using System.IO;                        // Added

namespace ST_02b6595da2274d7182409fb43af929ae.csproj
{
    [System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
    public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
    {
        #region VSTA generated code
        enum ScriptResults
        {
            Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
            Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
        };
        #endregion

        public void Main()
        {
            // Get the filename from the complete filepath
            String FileName = Path.GetFileName(Dts.Variables["User::FilePath"].Value.ToString());

            // Create a regex object with the pattern from the SSIS variable
            Regex FileFilterRegex = new Regex(Dts.Variables["User::RegexFilter"].Value.ToString());
            
            // Check if it is match and return that value (boolean) to the SSIS variable
            Dts.Variables["User::PassesRegexFilter"].Value = FileFilterRegex.IsMatch(FileName); 
            
            Dts.TaskResult = (int)ScriptResults.Success;
        }
    }
}
or with VB.Net
' VB.Net Code for filtering filenames with Regex
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.Text.RegularExpressions   ' Added
Imports System.IO                        ' Added

<System.AddIn.AddIn("ScriptMain", Version:="1.0", Publisher:="", Description:="")> _
<System.CLSCompliantAttribute(False)> _
Partial Public Class ScriptMain
 Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase

 Enum ScriptResults
  Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success
  Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
 End Enum
 
    Public Sub Main()
        ' Get the filename from the complete filepath
        Dim FileName As String = Path.GetFileName(Dts.Variables("User::FilePath").Value.ToString())

        ' Create a regex object with the pattern from the SSIS variable
        Dim FileFilterRegex As Regex = New Regex(Dts.Variables("User::RegexFilter").Value.ToString())

        ' Check if it is match and return that value (boolean) to the SSIS variable
        Dts.Variables("User::PassesRegexFilter").Value = FileFilterRegex.IsMatch(FileName)

        Dts.TaskResult = ScriptResults.Success
    End Sub

End Class


6) Precedence Constraints Expression
Connect the Script Task to the Data Flow Task and add an expression that does the real filtering:
@[User::PassesRegexFilter] == true
Precedence Constraint with expression





















7) The result
For testing the result, I added an other Script Task below the Data Flow Task that logs all files. My regular expression in this test case: .*[0-9]\.csv$    (all csv files that end with a number in the filename)
The result























Note 1: to gain some performance I should have changed the Foreach Loop File Enumerator wildcard from *.* to *.csv

Note 2: this method could be a bit inefficient if your file collection contains thousands of files and you only need two of them.

6 comments:

  1. This has helped me significantly, but I'm running into problems with this part:

    " _ _ Partial Public Class ScriptMain Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase "

    SSIS 2005 doesn't acknowledge System.AddIn.AddIn, or VSTARTScriptObjectModelBase. I tried using ScriptObjectModel, but now receive the error "Error 30387: Class 'ScriptTask_ed44248e579b4e0baaac2b3f49154d1d.ScriptMain' must declare a 'Sub New' because its base class 'Microsoft.SqlServer.Dts.Tasks.ScriptTask.ScriptObjectModel' does not have an accessible 'Sub New' that can be called with no arguments."

    Talking about the number of files in the collection, I guess my project is relatively straightforward since there is a beginning portion of the file that remains constant, so I can use that in Foreach Loop setup to only return files that begin with that portion, it's just there are now two separate processes that are based on the ending of the file names.

    Thanks for your help.

    ReplyDelete
    Replies
    1. Code is from SSIS 2008. Just copy the two new imports rows and the first 6 lines (not counting the empty lines) of the main method to your own main method. Let me know if it worked.

      Delete
  2. hi joost!

    just wondering if regex can handle two different file names (i.e., circle files with circle patterns and square files with square patterns) instead of two different extensions in one foreach loop container?

    for example:
    square_a.xlsx
    square_b.xlsx
    square_c.xlsx
    circle_a.xlsx
    circle_b.xlsx

    ReplyDelete
    Replies
    1. Should be no problem at all. There is an or available within regex. The or operator is an pipe |
      You have to google for the right syntax.

      Delete
  3. He there. It would be easier, probably, to retrieve the names of the relevant files through a script task and put them into a variable (object) that would hold the collection and then loop over it in a foreach component. Just a thought... You would then avoid going through potentially unnecessary files. It would have the potential to be extremely fast regardless of the number of files in the relevant folder.

    ReplyDelete

Please use the SSIS MSDN forum for general SSIS questions that are not about this post. I'm a regular reader of that forum and will gladly answer those questions over there.

All comments are moderated manually to prevent spam.