0845 643 64 63

Reiss McSporran

Azure Synapse Series: Hash Distribution and Shuffle

For this post I’m going to presume you’ve already taken a look at distributing your data using a hash column, and you’re not experiencing the performance you’re expecting. (If you’re not already aware of what this is, take a look at the following link to learn the basics of what a distributed table is and why you need it in Azure Synapse. I’ll be here when you get back.)

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

So now that we’re on the same page, one of the common issues encountered is trying to understand what Hash Distribution actually means and how it functions? In basic terms the column you choose to distribute by gets converted into a hash using a deterministic hash function, which creates the same value for any identical values passed. This places different rows of data on the same compute node, where the column(s) you have used to hash by match. This increases query performance and stops data being passed between Compute Nodes, massively increasing the performance of queries.

Great! So using the knowledge above I’ve created two tables, using a hash distribution and the data that I’m hashing by is the same value in each table. Let’s run an estimated execution plan to see how the joins work!

Shuffle 100%?

Oh dear. Shuffle. On the surface, shuffle is not necessarily a bad thing and may even be expected, but not in this scenario. So what is shuffle and why can it be bad?

Shuffle occurs when a part of a distributed table is moved to a different node during query execution. To do this a hash value is computed using the join columns, the node is then found that has that hash value and the row is then sent to that node for processing. If the tables you’re joining on have different hash distributions, you’d expect this to happen, (hence why it’s not always a bad thing!), but in this case I’ve used the same value to hash both of my tables, so what’s gone wrong?

Two words: Data Types

In my query below I have created two variables of the same data type (varchar(50)), given them the same text value, (Test Hash Text), and then run the hashing function of SHA2_256 over them to see the results. Running this query against your database will provide you with the same result every time when using the same hashing function.

declare @varchar1 varchar(50), @varchar2 varchar(50)
set @varchar1 = ‘Test Hash Text’
set @varchar2 = ‘Test Hash Text’

select @varchar1 as [Value], hashbytes(‘SHA2_256’,@varchar1) as [Hash]
union all
select @varchar2 as [Value], hashbytes(‘SHA2_256’,@varchar2) as [Hash]

As you can see here, the resulting Hash is the same for both text values.

Matching Hash = Good

But what if I change the data type? Here I’ve added a third variable with a nvarchar(50) data type instead, but with the same text value.

declare @varchar1 varchar(50), @varchar2 varchar(50), @nvarchar nvarchar(50)
set @varchar1 = ‘Test Hash Text’
set @varchar2 = ‘Test Hash Text’
set @nvarchar = ‘Test Hash Text’

select @varchar1 as [Value], hashbytes(‘SHA2_256’,@varchar1) as [Hash]
union all
select @varchar2 as [Value], hashbytes(‘SHA2_256’,@varchar2) as [Hash]
union all
select @nvarchar as [Value], hashbytes(‘SHA2_256’,@nvarchar) as [Hash]

As you can see here, the resulting Hash is different for the different data type even though the value is the same.

Different Hash = Bad

So what does this mean for our Synapse Distribution? Even if you’re distributing your tables on the same values, if you have a data type mismatch, you’re going to end up with a different Hash and an almost guaranteed different Node for your data. (Due to randomness, you may end up on the same Node by chance as there are a limited number of 60 Nodes, but not for the majority of your data).

To fix this, create a new computed column in your table in Synapse that has the same data type that you want to use across all tables using this same column, and Hash Distribute by that new column. The easiest way to do this is using the Create Table as Select (CTAS) command to create the new table with all of the data and a new data type. Add the new column in the select statement with a CAST/CONVERT to correct data type and use that for Hash Distribution. Additional reading on CTAS available here:

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-as-select-azure-sql-data-warehouse

So I’ve now put this in place and what does my execution plan look like?

No Shuffle!

That’s much better! No shuffle where there shouldn’t be one and query times against this large amount of data take <5% of the time they took before.

Any questions? Let me know in the comments below or contact me on twitter: http://www.twitter.com/PurpleFrogReiss

ADF: Where have my connections gone?

If you’ve been doing any development work in ADF this week you might have noticed that “Connections” has moved. But where has it gone?

The old location

When you click onto “Connections” now you’ll receive the following message:

Management Hub?

Clicking this button takes you to the new Management Hub, the new fourth icon which goes alongside the existing three.

New Manage (Management Hub)

Within this area you now have access to your Linked Services, Integration Runtimes and Triggers. These all have the same options as before, they’ve just moved!

Moved options and New options

You do now have access to more detailed Git configuration options and also the ability to change the Parameterization templates for your ARM templates. This is only accessible when you have a Git repository set up and the Parameter file within it.

You can currently still access Triggers from the main Author window, but I would recommend getting used to finding them in the Management Hub to keep everything together.

Release details of the Management Hub from Microsoft can be found here: https://docs.microsoft.com/en-us/azure/data-factory/author-management-hub

Azure Data Factory (ADF) Parent Activities don’t trigger ON ERROR OUTPUT when Child Activities Error… What happens?

So that title is a big question that a client of ours came across that required some additional investigation… What does happen and how do we get around it?

In the below example we’ll use a simple IF Activity as our Parent Activity. To explain briefly how it functions, you provide it with a query that evaluates to either True of False. This then passes down to a set of Activities If True and a different set of Activities If False. We then have three Outputs, On Success, On Failure and On Completion. This looks like:

(Image 1) Pipeline with If Activity and three outputs

The IF Activity therefore has logically four different things it does given the query criteria:

  1. If the query evaluates correctly, then:
    • If True, Pass to True Activities
    • If False, Pass to False Activities
  2. Error Activity if the query doesn’t evaluate to True or False.

Following that, we have the below options:

  • When 1. If True Activities complete with Success, Trigger On Success Output and On Completion Output
  • When 1. If False Activities complete with Success, Trigger On Success Output and On Completion Output
  • When 2. completes with Error, Trigger On Error Output and On Completion Output

The issue occurs with the below options:
What happens when the If True Activities complete with Error?
What happens when the If False Activities complete with Error?

Logically you would expect that to cause the IF Activity to error as well. Right? WRONG!

Although it shows as if it has an error in the pipeline view:

(Image 2) If Condition Error, without passing down On Success or On Failure?

The details show “Activity failed because an inner activity failed.”:

(Image 3) If Condition Error Message

This means that the IF Activity itself has still started down it’s success pathway and is now just showing the error from further down. This causes the IF Activity to complete with neither Success or Error! (You can see this in Image 2 above). You can also see that the On Completion Output has triggered successfully.

This means that the Error Output of the If Activity can only be used to handle Errors on the Evaluation of the If Query itself. (When it can’t get to true or false, or can’t execute for another reason).

If you want to monitor the steps inside the IF Activity, you’ll need to put those output pathways inside the True and/or False activities themselves and handle this at a deeper level.

I have launched a new twitter account http://www.twitter.com/PurpleFrogReiss so please follow me there for updates on when I post new blog posts!

System OutOfMemoryException when deploying SSIS .ispac

If you’ve tried deploying a .ispac using the Integration Services Deployment Wizard, you may have come across the following error:

By default SQL Server uses the 32-bit version of the Deployment Wizard, which is why you’ll be getting the Out of Memory error. If you instead launch the Wizard from the 64-bit location, (default is C:\Program Files\Microsoft SQL Server\130\DTS\Binn\ISDeploymentWizard.exe), then you should be able to deploy with no problem. (Please note, the “130” part of the above string may be different, depending on the version of SQL you have installed. This may reside in one of the other numbered folders in that location).

Using the same server and same project, the 32-bit deployment took over an hour and then failed with the error message, while the 64-bit took under a minute and deployed without issue!

Power BI Sentinel
The Frog Blog

Team Purple Frog specialise in designing and implementing Microsoft Data Analytics solutions, including Data Warehouses, Cubes, SQL Server, SSIS, ADF, SSAS, Power BI, MDX, DAX, Machine Learning and more.

This is a collection of thoughts, ramblings and ideas that we think would be useful to share.

Authors:

Alex Whittles
(MVP)
Reiss McSporran
Jeet Kainth
Jon Fletcher
Nick Edwards
Joe Billingham

Data Platform MVP

Power BI Sentinel
Frog Blog Out
twitter
rssicon