I have a column with html tags in the value. How can I remove those tags?
Solution
You could try the make an expression in a derived column, but it would probably become an unreadable expression. There are a couple of alternatives like the Konesans RegexClean Transformation. Links to more third party and open source components are availble in a previous article. And if your source is a database then you could use a user defined function, but in this case we will be using a .Net Regular Expression within a Script Component to remove the html tags.
File example:
"Name";"Message" "Joost";"Hallo, <strong>this</strong> is bold." "William";"This is my homepage: <a href='http://microsoft-ssis.blogspot.com/'>http://microsoft-ssis.blogspot.com/</a>." |
1) Source
Add a Flat File Source Component for the textfile above.
Flat File Source |
2) Script Component
Add a Script Component type transformation below the Flat File Source and give it a suitable name.
Script Component - transformation |
3) Input Column
Edit the Script Component and select the message column as ReadWrite on the Input Columns tab.
Message as ReadWrite input column |
4) The Script
SSIS creates some standard methods, but we only use Input0_ProcessInputRow. You can remove the rest. We are also adding a new method that removes the html tags. It uses a .Net Regex method. You can search the web for your own regex pattern and change it in the extra method.
// C# code using System; using System.Data; using System.Text.RegularExpressions; // Added using Microsoft.SqlServer.Dts.Pipeline.Wrapper; using Microsoft.SqlServer.Dts.Runtime.Wrapper; [Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute] public class ScriptMain : UserComponent { // Method that will be executed for each row. public override void Input0_ProcessInputRow(Input0Buffer Row) { // Replace the value of the message column Row.Message = RemoveHtml(Row.Message); } // A string method that removes html tags // with a regex pattern public String RemoveHtml(String message) { // The pattern for a html tag String htmlTagPattern = "<(.|\n)+?>"; // Create a regex object with the pattern Regex objRegExp = new Regex(htmlTagPattern); // Replace html tag by an empty string message = objRegExp.Replace(message, String.Empty); // Return the message without html tags return message; } }
or VB.Net
' VB.Net code Imports System Imports System.Data Imports System.Math Imports System.Text.RegularExpressions ' Added Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper Imports Microsoft.SqlServer.Dts.Runtime.Wrapper <Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute()> _ <CLSCompliant(False)> _ Public Class ScriptMain Inherits UserComponent Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) ' Replace the value of the message column Row.Message = RemoveHtml(Row.Message) End Sub ' A string method that removes html tags ' with a regex pattern Public Function RemoveHtml(message As [String]) As [String] ' The pattern for a html tag Dim htmlTagPattern As [String] = "<(.|" & vbLf & ")+?>" ' Create a regex object with the pattern Dim objRegExp As New Regex(htmlTagPattern) ' Replace html tag by an empty string message = objRegExp.Replace(message, [String].Empty) ' Return the message without html tags Return message End Function End Class
Note: this script uses a very simple regular expression. Search the web for other/better regular expressions.
5) The Result
For testing purposes I added a Derived Columns and two data viewes on the Data Flow Paths.
SSIS 2008 Package example
On replicating the steps you mentioned, i am geting blank value in the column.
ReplyDeleteHowever, i check the input value, it is not blank in the column. I even created a new output column, it showed the value in it after regex operation.
I have made the Read/Write too on the column.
Can you please help why it is making it blank?
Hi Anshul Agarwal,
DeleteIt's hard to see where it went wrong, but I will publish an example package tomorrow.
Kind regards,
Joost
Hi Joost,
ReplyDeleteI tried to replicate your example but i got some issues. I'm not able to build the script and second issues is my source is in DT_NTEXT need to convert to DT_WSTR and i tried using derived column and conversion to make it work the size is 8000 for the source column in Sql Server table.What would be the best procedure. I'm new to C# or .Net could you give me some example on that please ..
There is a package example at the bottum of the post. If you got errors then please mention then in your message of contact me via the form on the contact me page (upper right)
Delete