Tuesday, 21 February 2012

Replace multiple spaces with one space

Case
I have a column with lots of extra spaces in the tekst which I want to remove:
"This     has  a lot of     spaces     in          it  ."

The standard REPLACE ([MyColumnWithLotsOfSpaces], "  ", " ") function, where I replace a double space by a single space, only replaces one set of double spaces but leaves the rest.
Still double spaces left













Solution A
You could use the triple replace trick to replace duplicate spaces:

1) Identify the pairs of double spaces and replace them with a space and an unlikely character. I used a pipeline char:
REPLACE([MyColumnWithLotsOfSpaces], "  ", " |")

2) Replace the inverted version of the new pairs by nothing:
REPLACE(REPLACE([MyColumnWithLotsOfSpaces], "  ", " |"), "| ", "")

3) Replace the unlikely character by nothing:
REPLACE(REPLACE(REPLACE([MyColumnWithLotsOfSpaces], "  ", " |"), "| ", ""), "|", "")

4) The Result
I added some data viewers for testing purposes:
No doubles spaces after triple replace













Solution B
Or you could use this one-line Script Component solution where you replace the double spaces by one space using a regular expression REPLACE.

1) Script Component
Add a Script Component type Transformation and connect it to your flow.
Script Component, type Transformation





















2) Add ReadWrite input column
Select the column where you want to get rid of the double spaces as ReadWrite.























3) The Script
Edit the script and copy the following code to your Input0_ProcessInputRow method to replace the double spaces:
// C# Code
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        // Replace double spaces with a regular expression
        Row.MyColumnWithLotsOfSpaces = System.Text.RegularExpressions.Regex.Replace(Row.MyColumnWithLotsOfSpaces, @"\s+", " "); 
    }
}
or VB.Net
' VB.Net Code
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute> _
<CLSCompliant(False)> _
Public Class ScriptMain
    Inherits UserComponent

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        Row.MyColumnWithLotsOfSpaces = System.Text.RegularExpressions.Regex.Replace(Row.MyColumnWithLotsOfSpaces, "\s+", " ")
    End Sub
End Class

4) The Result
For testing purposes I added some data viewers.
The result













Conclusion
The first method is easier for people who don't know .Net, but the second method is saver because you don't have to come up with the unlikely character.

1 comment:

Please use the SSIS MSDN forum for general SSIS questions that are not about this post. I'm a regular reader of that forum and will gladly answer those questions over there.

All comments are moderated manually to prevent spam.