Sunday, 29 October 2017

Calculating Hash values in SSIS

Case
I want to calculate a hash value for a couple of columns in SSIS. In T-SQL you can use HASHBYTES, but that doesn't work for other sources like flat files and for SQL 2012/2014 the input is limited to only 8000 bytes. Is there an alternative for HASHBYTES?
Calculating a hash value over multiple columns















Solution
There are several alternatives for the T-SQL HASHBYTES. First of all there are various custom components available for SSIS like the SSIS Multiple Hash on codeplex, but if you don't want to (or cannot) use custom components, you can accomplish the same result with a little .NET scripting. If you really want to stick to T-SQL, then you can also first stage your files in a table and then calculate the hash with T-SQL afterwards. This blog will show you the scripting solution.

But first, why do you need a hash? When you want to keep track of history with a Persistent stage, Data Vault or Data Warehouse you want to know whether the record from the stage layer is different then the one you have in your historical data layer. You could check each column one by one, but when you have a whole bunch of columns that could be a lot of work and a bit slow.

A hash in ETL is used to generate a single, corresponding (but shorter) value for a whole bunch of columns. It is stored in the stage table as a new column. If one character changes in one of those columns then the hash value will also be different. When comparing the two records (one from the stage layer and one from the historical layer) you now only have to compare the hash value. If it did not change you know you don't have to process the record in your historical layer. Since you only want to calculate the hash once (in the stage package) you will also store it in the historical layer.

Now it is time to explain the scripting solution

1) Starting point
The starting point of this example is a Data Flow Task with a Flat File source component.
Flat File Source












2) Script Component - Input Columns
Add a new Script Component (transformation) to the Data Flow Task. Give it a suitable name and connect it to your flow. Then edit it and select all columns you want to hash on the Input Columns pane. Since we are not changing the existing column you can keep the default Usage Type 'ReadOnly'.
Script Component Input Columns

















Which columns do you want to hash? Three most chosen options:
  1. If you do not know the Primary Key: select all columns to calculate the hash.
  2. If you do know the Primary Key: select all columns except the Primary Key to calculate the hash.
  3. If the Primary Key consists of multiple columns you could even calculate a separate hash for the key columns only.
3) Script Component - Output Column
We need to store the calculated hash in a new column. Go to the Inputs and Outputs pane and add a new column in Output 0. The data type is string and the size depends on which hash algoritme you want to use. For this example we use the MD5 algoritme which returns a 128 bits hash. When you convert that to an ASCII string it would be a 32 character string (that only contains hexadecimal digits).
Script Component Inputs and Outputs























4) Script Component - The script preparation
Now we are almost ready to add the actual script. Go to the Script pane. Select your scripting language. This example will be in C#. Then hit the Edit Script... button to start the Vsta environment. This is a new instance of Visual Studio and will take a few moments to start.
Edit Script...























Optional:
I always start by removing all unnecessary methods and comments to keep the code clean. For this example we do not need the PreExecute and PostExecute methods and I do not want to keep the default help comments.
Clean up before start



















5) Script Component - The code
First we need to add two extra usings to shorten the code. Unfold the Namespaces region at the top and add the following usings:
using System.Security.Cryptography;
using System.Text;

Then Locate the method called Input0_ProcessInputRow and add a new GetMd5Hash method below this existing method (below its closing }). The new method is copied from this MSDN page. I only changed the encoding to Unicode (see note 1):
static string GetMd5Hash(MD5 md5Hash, string input)
{
    // Convert the input string to a byte array and compute the hash.
    byte[] data = md5Hash.ComputeHash(Encoding.Unicode.GetBytes(input));

    // Create a new Stringbuilder to collect the bytes
    // and create a string.
    StringBuilder sBuilder = new StringBuilder();

    // Loop through each byte of the hashed data
    // and format each one as a hexadecimal string.
    for (int i = 0; i < data.Length; i++)
    {
        sBuilder.Append(data[i].ToString("x2"));
    }

    // Return the hexadecimal string.
    return sBuilder.ToString();
}


And at last change the code of the existing method Input0_ProcessInputRow to:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    string Separator = "|";
    string RowData = "";

    using (MD5 md5Hash = MD5.Create())
    {
        Row.Hash = GetMd5Hash(md5Hash, RowData = (Row.Title_IsNull ? "" : Row.Title.ToString()) + Separator + (Row.FirstName_IsNull ? "" : Row.FirstName.ToString()) + Separator + (Row.MiddleName_IsNull ? "" : Row.MiddleName.ToString()) + Separator + (Row.LastName_IsNull ? "" : Row.LastName.ToString()));
    }
}


The code above first concatenates all columns with a separator between them (see note 2) and it checks whether the value isn't NULL because we cannot add NULL to a string (see note 3). You will see that it repeats this piece of code for each column before calling the hash method:
(Row.Title_IsNull ? "" : Row.Title.ToString()) + Separator
For the first record in our example it will hash the following text: Mr.|Syed|E|Abbas
And for the third row that contains a null value it will hash the this text: Ms.|Kim||Abercrombie

6) Testing the code
After closing the Vsta editor and clicking OK in the Script Component to close it, add a dummy Derived Column behind it and add a Data Viewer to see the result.
Similar result to T-SQL HASHBYTES


















Note 1:
When you want the exact same result as with T-SQL HASHBYTES then you have to make sure you use the same encoding. Otherwise you get a different hash. In the method GetMd5Hash on the first line of code you see Encoding.Unicode.GetBytes(. There are more options besides Unicode. For example: ASCII, UTF7, UTF8, UTF32, etc. etc. However, as long as you don't have to compare hashes generated by to different methods (T-SQL and .Net) it doesn't matter. In this stackoverflow post you find more examples.

Note 2:
The column separator is added to prevent unwanted matches. If you have these two records with two columns:
Column1 Column2
123 456
12 3456
Without the separator these two will both get concatenated to 123456 and therefor generate the same hash. With the separator you will have two different values to hash: 123|456 and 12|3456. Choose your separator wisely. The number 3 would not be a wise choice in this case.

Note 3:
In the code you see that the columns are checked for null values because you cannot add null to a string. The null values are replace with an empty string. However this shows a bit of an imperfection of this method, because a string with a null value isn't the same as an empty string. To overcome this you could use a different string that is likely to occur in your text. For Numeric and Date data types you could just add an empty string, something like:
(Row.MyNumberColumn_IsNull ? "" : Row.MyNumberColumn.ToString()) + Separator
(Row.MyDateColumn_IsNull ? "" : Row.MyDateColumn.ToString()) + Separator


Note 4:
md5 only uses 128 bits and there are better, saver (, but also a bit slower) methods to calculate hashes:
SHA and SHA1 - 160 bits
SHA2_256 - 256 bits
SHA2_512 - 512 bits

Safer? As long as you don't use it to hash passwords you are still OK with md5.
Better? In rare cases two different strings could return the same md5 hash, but you have a higher chance to win the galaxy lottery.

Rather use SHA2_512? Just use this code instead:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    string Separator = "|";
    string RowData = "";

    using (SHA512 shaHash = new SHA512Managed())
    {
        Row.hash2 = GetShaHash(shaHash, RowData = (Row.Title_IsNull ? "" : Row.Title.ToString()) + Separator + (Row.FirstName_IsNull ? "" : Row.FirstName.ToString()) + Separator + (Row.MiddleName_IsNull ? "" : Row.MiddleName.ToString()) + Separator + (Row.LastName_IsNull ? "" : Row.LastName.ToString()));
    }
}

static string GetShaHash(SHA512 shaHash, string input)
{
    // Convert the input string to a byte array and compute the hash.
    byte[] data = shaHash.ComputeHash(Encoding.Unicode.GetBytes(input));

    // Create a new Stringbuilder to collect the bytes
    // and create a string.
    StringBuilder sBuilder = new StringBuilder();

    // Loop through each byte of the hashed data
    // and format each one as a hexadecimal string.
    for (int i = 0; i < data.Length; i++)
    {
        sBuilder.Append(data[i].ToString("x2"));
    }

    // Return the hexadecimal string.
    return sBuilder.ToString();
}

Too much columns => too much coding?
In my book Extending SSIS with .NET Scripting you will find a script component example that just loops through all columns to generates the hash. No money to buy it? I used this code as the base for that script.
An other alternative is to generate the Script Component and its code with BIML. Here is an example of a Script Component in BIML, but getting the hash to work is a bit of a challenge but doable.