Saturday, 5 March 2011

Cleaning with Regular Expressions in SSIS

Case
I have a column with html tags in the value. How can I remove those tags?

Solution
You could try the make an expression in a derived column, but it would probably become an unreadable expression. There are a couple of alternatives like the Konesans RegexClean Transformation. Links to more third party and open source components are availble in a previous article. And if your source is a database then you could use a user defined function, but  in this case we will be using a .Net Regular Expression within a Script Component to remove the html tags.

File example:
"Name";"Message"
"Joost";"Hallo, <strong>this</strong> is bold."
"William";"This is my homepage: <a href='http://microsoft-ssis.blogspot.com/'>http://microsoft-ssis.blogspot.com/</a>."


1) Source
Add a Flat File Source Component for the textfile above.
Flat File Source
















2) Script Component
Add a Script Component type transformation below the Flat File Source and give it a suitable name.
Script Component - transformation






















3) Input Column
Edit the Script Component and select the message column as ReadWrite on the Input Columns tab.
Message as ReadWrite input column



















4) The Script
SSIS creates some standard methods, but we only use Input0_ProcessInputRow. You can remove the rest. We are also adding a new method that removes the html tags. It uses a .Net Regex method. You can search the web for your own regex pattern and change it in the extra method.
// C# code
using System;
using System.Data;
using System.Text.RegularExpressions;    // Added
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    // Method that will be executed for each row.
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        // Replace the value of the message column
        Row.Message = RemoveHtml(Row.Message);
    }

    // A string method that removes html tags
    // with a regex pattern
    public String RemoveHtml(String message)
    {
        // The pattern for a html tag
        String htmlTagPattern = "<(.|\n)+?>";
        // Create a regex object with the pattern 
        Regex objRegExp = new Regex(htmlTagPattern);
        // Replace html tag by an empty string
        message = objRegExp.Replace(message, String.Empty);
        // Return the message without html tags
        return message;
    }
}

or VB.Net

' VB.Net code
Imports System
Imports System.Data
Imports System.Math
Imports System.Text.RegularExpressions      ' Added
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute()> _
<CLSCompliant(False)> _
Public Class ScriptMain
    Inherits UserComponent

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        ' Replace the value of the message column
        Row.Message = RemoveHtml(Row.Message)
    End Sub

    ' A string method that removes html tags
    ' with a regex pattern
    Public Function RemoveHtml(message As [String]) As [String]
        ' The pattern for a html tag
        Dim htmlTagPattern As [String] = "<(.|" & vbLf & ")+?>"
        ' Create a regex object with the pattern 
        Dim objRegExp As New Regex(htmlTagPattern)
        ' Replace html tag by an empty string
        message = objRegExp.Replace(message, [String].Empty)
        ' Return the message without html tags
        Return message
    End Function
End Class


Note: this script uses a very simple regular expression. Search the web for other/better regular expressions.

5) The Result
For testing purposes I added a Derived Columns and two data viewes on the Data Flow Paths.


















SSIS 2008 Package example

4 comments:

  1. On replicating the steps you mentioned, i am geting blank value in the column.

    However, i check the input value, it is not blank in the column. I even created a new output column, it showed the value in it after regex operation.

    I have made the Read/Write too on the column.

    Can you please help why it is making it blank?

    ReplyDelete
    Replies
    1. Hi Anshul Agarwal,

      It's hard to see where it went wrong, but I will publish an example package tomorrow.

      Kind regards,

      Joost

      Delete
  2. Hi Joost,

    I tried to replicate your example but i got some issues. I'm not able to build the script and second issues is my source is in DT_NTEXT need to convert to DT_WSTR and i tried using derived column and conversion to make it work the size is 8000 for the source column in Sql Server table.What would be the best procedure. I'm new to C# or .Net could you give me some example on that please ..

    ReplyDelete
    Replies
    1. There is a package example at the bottum of the post. If you got errors then please mention then in your message of contact me via the form on the contact me page (upper right)

      Delete

Please use the SSIS MSDN forum for general SSIS questions that are not about this post. I'm a regular reader of that forum and will gladly answer those questions over there.

All comments are moderated manually to prevent spam.