Saturday, 7 June 2014

Is the Row Sampling Transformation fully blocking?

Case
Is the Row Sampling Transformation a (fully) blocking component or a non-blocking component? Various blogs and forums disagree on each other. Is it possible to see it in SSIS?

Solution
By adding some logging to the package you can see the Pipeline Execution Trees of your Data Flow. This will tell you whether a transformation creates a new buffer. When it creates a new buffer then the component is a-synchronously / blocking.


1) Add logging
Open your package and go to the SSIS menu and choose Logging...
SSIS Menu - Logging...
















2) SQL logging
For this example I selected SQL Server logging, pressed the add button and then selected a Connection Manager.
SQL Server logging






















Next check the Data Flow Task (1) and the enable the SQL Server Log (2) for it. After that go to the Details tab (3).
Enable logging for the Data Flow Task





















In the details tab scroll down and select PipelineExecutionTrees. Then click on the advanced button.
Logging Details





















In the advanced section make sure that the MessageText is checked. In this column you will find the execution trees text. After that click OK to close the logging window.
Check MessageText
















3) Run and check log
Run the package and then check the log. In the column Message you can find the Execution Trees.
SQL Server Logging. Check message column.











Now check these examples and see which is blocking. Multiple paths means that there is a transformation that creates new buffers. The last example is the Row Sampling:
Plain Data Flow: 1 path - No blocking

Data Flow with Sort: 2 paths - Sort is blocking

Data Flow with Percentage Sampling: 1 path - No blocking

Data Flow with Row Sampling: 3 paths - Row Sampling is blocking














































4) Partial Blocking or Fully Blocking?
Unfortunately you can't see in the log whether the task is partially blocking (like union or merge join) or fully blocking (like sort and aggregate), but you can see it in the Data Flow Task when running your package in Visual Studio (BIDS/SSDT). The Row Sampling Transformation is fully blocking because it apparently need all data before it sends data to its output. So try not to use it unnecessarily.

Row Sampling is Fully Blocking































The Row Sampling isn't just doing a TOP X, but it spreads the sampled rows over all buffers. Because you don't know the number of records or buffers that is coming you have to have all rows before you can pick randomly X records from the whole set.

An alternative (with less random rows) could be to use a Script Component that adds a row number and then use a Conditional Split to select the first X rows (and perhaps combine it with a modulo expression like: Rownumber % 3 == 0 && Rownumber <= 3000). The Conditional Split is a non-blocking component. Note: this isn't necessarily faster. Check it first for your case!

So why is the Percentage Sampling not blocking? It just takes the same percentage of each buffer and can therefore be synchronous.


Confirmed by Mister SSIS himself!

No comments:

Post a Comment

Please use the SSIS MSDN forum for general SSIS questions that are not about this post. I'm a regular reader of that forum and will gladly answer those questions over there.

All comments are moderated manually to prevent spam.