Thursday, 10 February 2011

Script Component as source 1

Case
My CSV file has some rows where not all columns at the end of the row are filled:

"test1";"abc";"xyz";"123"
"test2";"cba";"zyx";"321"
"test3";"abc"
"test4";"efg";"zyx"
"test5";"cba";"zyx";"321"

If you use a flat file connection manager, you will get strange things:
Some rows will 'merge'




















And if the last row is incomplete you will get:








Is there a workaround for this poorly formatted file?

Solution
You can use a Script Component as a source to solve this.
* UPDATE: This has been solved in SQL Server 2012 RC0 *

1) Script Component
Go to your Data Flow and add a Script Component. The Type should be Source.
Script Type Source




















2) Add Columns
Edit the Script Component and go to the Inputs and Outputs tab. Add four string columns at the output section (this example only uses strings).
Add output columns



















3) Connection Manager
Go to the Connection Managers tab and add a connection (1) to avoid a hardcoded connection string in your Script. The Connection Manager Type (2) should be FILE and the Usage type (3) should be Existing file. After that change the default name from "Connection" to "myFile".
Add FILE connection to avoid hardcoded connection strings


























4) The Script
Go to the first tab and hit the Edit Script... button. This script is in the default C# language. SSIS will create three methods: PreExecute, PostExecute and CreateNewOutputRows. Remove the first two, because we don't need them. Replace your method with this one and add the using row at the top.
// C# code
using System;
using System.Data;
using System.IO;    // Added
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    public override void CreateNewOutputRows()
    {
        // Read file (use the FILE connection added in the script component, named myFile)
        using (StreamReader sr = new StreamReader(this.Connections.myFile.ConnectionString, System.Text.Encoding.UTF7)) // Encoding is optional
        {
            String line;
            // Read lines from the file until the end of the file is reached.
            while ((line = sr.ReadLine()) != null)
            {
                // Split the line into columns
                string[] columns = line.Split(';');

                // Add one new row
                this.Output0Buffer.AddRow();

                // Fill columns, but check if they exist
                if (columns.Length > 0)
                {
                    // Remove the " at the start and end of the string
                    // with a trim or use a substring.
                    Output0Buffer.MyFirstColumn = columns[0].TrimStart('"').TrimEnd('"');
                }
                if (columns.Length > 1)
                {
                    Output0Buffer.MySecondColumn = columns[1].TrimStart('"').TrimEnd('"');
                }
                if (columns.Length > 2)
                {
                    Output0Buffer.MyThirdColumn = columns[2].TrimStart('"').TrimEnd('"');
                }
                if (columns.Length > 3)
                {
                    Output0Buffer.MyFourthColumn = columns[3].TrimStart('"').TrimEnd('"');
                }
            }
        }
    }
}
or VB.Net
// C# code
' VB.Net code
Imports System
Imports System.Data
Imports System.IO  'Added
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute> _
<CLSCompliant(False)> _
Public Class ScriptMain
    Inherits UserComponent

    Public Overrides Sub CreateNewOutputRows()
        ' Read file (use the FILE connection added in the script component, named myFile)
        Using sr As New StreamReader(Me.Connections.myFile.ConnectionString, System.Text.Encoding.UTF7) ' Encoding is optional

            ' Create variable to store a line of text
            Dim line As String = sr.ReadLine

            ' Read lines from the file until the end of the file is reached.
            Do While (Not line Is Nothing)

                ' Split the line into columns
                Dim columns As String() = line.Split(";")

                ' Add one new row
                Me.Output0Buffer.AddRow()

                ' Fill columns, but check if they exist
                If columns.Length > 0 Then
                    ' Remove the " at the start and end of the string
                    ' with a trim or use a substring.
                    Output0Buffer.MyFirstColumn = columns(0).TrimStart("""").TrimEnd("""")
                End If
                If columns.Length > 1 Then
                    Output0Buffer.MySecondColumn = columns(1).TrimStart("""").TrimEnd("""")
                End If
                If columns.Length > 2 Then
                    Output0Buffer.MyThirdColumn = columns(2).TrimStart("""").TrimEnd("""")
                End If
                If columns.Length > 3 Then
                    Output0Buffer.MyFourthColumn = columns(3).TrimStart("""").TrimEnd("""")
                End If

                ' Goto next line
                line = sr.ReadLine
            Loop
        End Using
    End Sub
End Class

5) The result
Add a random transformation component after your source and add a dataviewer between them to see the result.
Notice the NULL values

















Note: You should add some error handling to your script (Try Catch) to avoid unexpected errors (no file, other delimiter, empty rows, etc.).

Note: If you are using this in a foreach loop, you don't need the connection manager. The variable from the foreach loop already contains the filepath.

5 comments:

  1. Good post. I think it's worth mentioning in the article that the code will produce unexpected (and probably unwanted) results when a field value contains the separator character. Perhaps you have some code for this that you can share :)

    ReplyDelete
  2. @Anonymous: Thanks. Sorry I don't have code for that, this is just a very basic example which people can extend for there own situations. Embedded qualifiers and separators are quite difficult, but there is also good news: SQL 2012 has solved this problem!

    ReplyDelete
  3. Hi There
    I am getting this error when I using your code. I am pretty new to C#.

    'Connections' does not contain a definition for 'myFile' and no extension method 'myFile' accepting a first argument of type 'Connections' could be found (are you missing a using directive or an assembly reference?)

    PS : I used same myFile to connect to a CSV file

    Your help is appreciated
    Thanks

    ReplyDelete
    Replies
    1. Make sure the Connection Manager type of myFile is FILE.

      To check whether it is known in your script type "this.Connections." and see which connection managers are available via intellisense.

      Delete
    2. In the Script Transformation Editor window and in the Connection Managers pane rename your Connection managers to "myFile" the default name is "connection", Intellisense will not able to read "myFile" instead "connection" if you do not renamed it . If you do not want to rename it just use "this.Connections.Connection"

      Delete

Please use the SSIS MSDN forum for general SSIS questions that are not about this post. I'm a regular reader of that forum and will gladly answer those questions over there.

All comments are moderated manually to prevent spam.

Related Posts Plugin for WordPress, Blogger...