I want to use a Regular Expression to filter incorrect emailaddresses in SSIS.
Solution
There are a couple of options:
- Microsoft built a SSIS Sample Component for regex, but it's only available for 2005.
- Konesans built a RegexClean Transformation and a Regular Expression Transformation.
- There is also an open source project named RegExtractor SSIS Component.
- And you can even do it with Regular Expressions in T-SQL.
- But you can also do it your self with a Script Component Transformation.
"Name";"emailaddress" "Joost";"joost@blogger.com" "John";"john@.nl" "Max";"max@gmail.com" "William";"william@hotmail,com" "Bill";"b.gates@microsoft.com" |
1) Add source
Add a Flat File Source Component in your dataflow for the file example above.
Flat File mailinglist |
2) Add Script Component
Add a Script Component, type transformation, after your Flat File Source Component.
Script Component (transformation) |
3) Input Columns
Edit the Script Component and select the email address column as an input column (readonly).
Input Columns tab |
4) Outputs
Change the name of the default output port to CorrectEmail and set the ExclusionGroup to 1. Add an extra output port for the IncorrectEmail and set the ExclusionGroup to 1. Also select the input port as the SynchronousInputID. Now you have two output ports. One for the correct and one for the incorrect email address.
Inputs and Outputs tab |
4) The Script
Hit the Edit Script button to open the VSTA editor. SSIS creates some standard methods, but we only use Input0_ProcessInputRow. You can remove the rest of the methods. We are also adding a new method that validates the email address. It uses a .Net Regex method. You can search the web for your own regex pattern and change it in the extra method.
// C# code using System; using System.Data; using System.Text.RegularExpressions; // Added using Microsoft.SqlServer.Dts.Pipeline.Wrapper; using Microsoft.SqlServer.Dts.Runtime.Wrapper; [Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute] public class ScriptMain : UserComponent { // Method that will be executed for each row. public override void Input0_ProcessInputRow(Input0Buffer Row) { // Validate the email address if (IsCorrectEmail(Row.Email)) { // Redirect the correct email address Row.DirectRowToCorrectEmail(); } else { // Redirect the incorrect email address Row.DirectRowToInCorrectEmail(); } } // A boolean method that validates an email address // with a regex pattern. public bool IsCorrectEmail(String emailAddress) { // The pattern for email string emailAddressPattern = @"^(([^<>()[\]\\.,;:\s@\""]+" + @"(\.[^<>()[\]\\.,;:\s@\""]+)*)|(\"".+\""))@" + @"((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" + @"\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+" + @"[a-zA-Z]{2,}))$"; // Create a regex object with the pattern Regex emailAddressRegex = new Regex(emailAddressPattern); // Check if it is match and return that value (boolean) return emailAddressRegex.IsMatch(emailAddress); } }
' VB.Net code (SSIS 2008) Imports System Imports System.Data Imports System.Math Imports System.Text.RegularExpressions ' Added Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper Imports Microsoft.SqlServer.Dts.Runtime.Wrapper_ _ Public Class ScriptMain Inherits UserComponent ' Method that will be executed for each row. Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) ' Validate the email address If (IsCorrectEmail(Row.Email)) Then ' Redirect the correct email address Row.DirectRowToCorrectEmail() Else ' Redirect the incorrect email address Row.DirectRowToIncorrectEmail() End If End Sub ' A boolean method that validates an email address ' with a regex pattern. Public Function IsCorrectEmail(ByVal emailAddress As String) As Boolean ' The pattern for email Dim emailAddressPattern As String = "^(([^<>()[\]\\.,;:\s@\""]+" + _ "(\.[^<>()[\]\\.,;:\s@\""]+)*)|(\"".+\""))@" + _ "((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" + _ "\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+" + _ "[a-zA-Z]{2,}))$" ' Create a regex object with the pattern Dim emailAddressRegex As Regex = New Regex(emailAddressPattern) ' Check if it is match and return that value (boolean) Return emailAddressRegex.IsMatch(emailAddress) End Function End Class
5) The result
For testing purposes I added two Derived Columns with data viewes on the Data Flow Paths.
The result |
Jamie Thomson, alias SSIS Junkie, has an other example in VB.net available. And you can also use Regular Expression to clean column values. More about that is described in this article.
Keep in mind I'm not a C# programmer, but your code doesn't work for me. I get too many errors when placing in the code of a script task. Obviously this is most likely because you ommited to explain how this is accomplished in your example. I would like to see you put a little more step by step information instead of assuming that all of us reading your blog know what you know.
ReplyDeleteThanks
Joe
@Kevin Ward: Thanks for the feedback. Try the example of SSIS MVP Jamie Thomson. May be you find his example easier.
ReplyDeleteAnd an even better option for non-programmers is to use one of the third party/open source components I mentioned. Then you don't need my difficult coding.
If you do want to use my coding. Let me know what errors you got or where you lost me.
Worked great, thank you!
ReplyDeletecan you plz tell me where to place c# code???
ReplyDeleteDouble click the script component and hit the Edit Script-button on the first page.
DeleteHi Joost
ReplyDeleteThanks for the script. I created the script, it runs ok, but I do not see my data after the scripting step. I am trying to validate a 5 digits US zip code (12345), I used "^\\d{5}$" as the zipCode pattern.
Best Regards
For my previous question, I got it to work; the problem is that I forgot to save the script after I ran it, believing that it will be saved automatically.
ReplyDeleteI followed the tutorial and got my package to work.
Thank you.
Code in the .cs or .vb file is automatically saved when you close the vsta editor. Only changes in the internal script component project have to be saved by pressing Save All. For example when you add a reference to a dll.
Delete