Case
My CSV file has some rows where not all columns at the end of the row are filled:
"test1";"abc";"xyz";"123"
"test2";"cba";"zyx";"321"
"test3";"abc"
"test4";"efg";"zyx"
"test5";"cba";"zyx";"321"
If you use a flat file connection manager, you will get strange things:
Some rows will 'merge' |
And if the last row is incomplete you will get:
Is there a workaround for this poorly formatted file?
Solution
You can use a Script Component as a source to solve this.
* UPDATE: This has been solved in SQL Server 2012 RC0 *
1) Script Component
Go to your Data Flow and add a Script Component. The Type should be Source.
Script Type Source |
2) Add Columns
Edit the Script Component and go to the Inputs and Outputs tab. Add four string columns at the output section (this example only uses strings).
Add output columns |
3) Connection Manager
Go to the Connection Managers tab and add a connection (1) to avoid a hardcoded connection string in your Script. The Connection Manager Type (2) should be FILE and the Usage type (3) should be Existing file. After that change the default name from "Connection" to "myFile".
Add FILE connection to avoid hardcoded connection strings |
4) The Script
Go to the first tab and hit the Edit Script... button. This script is in the default C# language. SSIS will create three methods: PreExecute, PostExecute and CreateNewOutputRows. Remove the first two, because we don't need them. Replace your method with this one and add the using row at the top.
// C# code using System; using System.Data; using System.IO; // Added using Microsoft.SqlServer.Dts.Pipeline.Wrapper; using Microsoft.SqlServer.Dts.Runtime.Wrapper; [Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute] public class ScriptMain : UserComponent { public override void CreateNewOutputRows() { // Read file (use the FILE connection added in the script component, named myFile) using (StreamReader sr = new StreamReader(this.Connections.myFile.ConnectionString, System.Text.Encoding.UTF7)) // Encoding is optional { String line; // Read lines from the file until the end of the file is reached. while ((line = sr.ReadLine()) != null) { // Split the line into columns string[] columns = line.Split(';'); // Add one new row this.Output0Buffer.AddRow(); // Fill columns, but check if they exist if (columns.Length > 0) { // Remove the " at the start and end of the string // with a trim or use a substring. Output0Buffer.MyFirstColumn = columns[0].TrimStart('"').TrimEnd('"'); } if (columns.Length > 1) { Output0Buffer.MySecondColumn = columns[1].TrimStart('"').TrimEnd('"'); } if (columns.Length > 2) { Output0Buffer.MyThirdColumn = columns[2].TrimStart('"').TrimEnd('"'); } if (columns.Length > 3) { Output0Buffer.MyFourthColumn = columns[3].TrimStart('"').TrimEnd('"'); } } } } }or VB.Net
// C# code ' VB.Net code Imports System Imports System.Data Imports System.IO 'Added Imports System.Math Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper Imports Microsoft.SqlServer.Dts.Runtime.Wrapper <Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute> _ <CLSCompliant(False)> _ Public Class ScriptMain Inherits UserComponent Public Overrides Sub CreateNewOutputRows() ' Read file (use the FILE connection added in the script component, named myFile) Using sr As New StreamReader(Me.Connections.myFile.ConnectionString, System.Text.Encoding.UTF7) ' Encoding is optional ' Create variable to store a line of text Dim line As String = sr.ReadLine ' Read lines from the file until the end of the file is reached. Do While (Not line Is Nothing) ' Split the line into columns Dim columns As String() = line.Split(";") ' Add one new row Me.Output0Buffer.AddRow() ' Fill columns, but check if they exist If columns.Length > 0 Then ' Remove the " at the start and end of the string ' with a trim or use a substring. Output0Buffer.MyFirstColumn = columns(0).TrimStart("""").TrimEnd("""") End If If columns.Length > 1 Then Output0Buffer.MySecondColumn = columns(1).TrimStart("""").TrimEnd("""") End If If columns.Length > 2 Then Output0Buffer.MyThirdColumn = columns(2).TrimStart("""").TrimEnd("""") End If If columns.Length > 3 Then Output0Buffer.MyFourthColumn = columns(3).TrimStart("""").TrimEnd("""") End If ' Goto next line line = sr.ReadLine Loop End Using End Sub End Class
5) The result
Add a random transformation component after your source and add a dataviewer between them to see the result.
Notice the NULL values |
Note: You should add some error handling to your script (Try Catch) to avoid unexpected errors (no file, other delimiter, empty rows, etc.).
Note: If you are using this in a foreach loop, you don't need the connection manager. The variable from the foreach loop already contains the filepath.
Good post. I think it's worth mentioning in the article that the code will produce unexpected (and probably unwanted) results when a field value contains the separator character. Perhaps you have some code for this that you can share :)
ReplyDelete@Anonymous: Thanks. Sorry I don't have code for that, this is just a very basic example which people can extend for there own situations. Embedded qualifiers and separators are quite difficult, but there is also good news: SQL 2012 has solved this problem!
ReplyDeleteHi There
ReplyDeleteI am getting this error when I using your code. I am pretty new to C#.
'Connections' does not contain a definition for 'myFile' and no extension method 'myFile' accepting a first argument of type 'Connections' could be found (are you missing a using directive or an assembly reference?)
PS : I used same myFile to connect to a CSV file
Your help is appreciated
Thanks
Make sure the Connection Manager type of myFile is FILE.
DeleteTo check whether it is known in your script type "this.Connections." and see which connection managers are available via intellisense.
In the Script Transformation Editor window and in the Connection Managers pane rename your Connection managers to "myFile" the default name is "connection", Intellisense will not able to read "myFile" instead "connection" if you do not renamed it . If you do not want to rename it just use "this.Connections.Connection"
Delete