0845 643 64 63

Monthly Archives: March 2021

Azure Synapse Series: Hash Distribution and Shuffle

For this post I’m going to presume you’ve already taken a look at distributing your data using a hash column, and you’re not experiencing the performance you’re expecting. (If you’re not already aware of what this is, take a look at the following link to learn the basics of what a distributed table is and why you need it in Azure Synapse. I’ll be here when you get back.)

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

So now that we’re on the same page, one of the common issues encountered is trying to understand what Hash Distribution actually means and how it functions? In basic terms the column you choose to distribute by gets converted into a hash using a deterministic hash function, which creates the same value for any identical values passed. This places different rows of data on the same compute node, where the column(s) you have used to hash by match. This increases query performance and stops data being passed between Compute Nodes, massively increasing the performance of queries.

Great! So using the knowledge above I’ve created two tables, using a hash distribution and the data that I’m hashing by is the same value in each table. Let’s run an estimated execution plan to see how the joins work!

Shuffle 100%?

Oh dear. Shuffle. On the surface, shuffle is not necessarily a bad thing and may even be expected, but not in this scenario. So what is shuffle and why can it be bad?

Shuffle occurs when a part of a distributed table is moved to a different node during query execution. To do this a hash value is computed using the join columns, the node is then found that has that hash value and the row is then sent to that node for processing. If the tables you’re joining on have different hash distributions, you’d expect this to happen, (hence why it’s not always a bad thing!), but in this case I’ve used the same value to hash both of my tables, so what’s gone wrong?

Two words: Data Types

In my query below I have created two variables of the same data type (varchar(50)), given them the same text value, (Test Hash Text), and then run the hashing function of SHA2_256 over them to see the results. Running this query against your database will provide you with the same result every time when using the same hashing function.

declare @varchar1 varchar(50), @varchar2 varchar(50)
set @varchar1 = ‘Test Hash Text’
set @varchar2 = ‘Test Hash Text’

select @varchar1 as [Value], hashbytes(‘SHA2_256’,@varchar1) as [Hash]
union all
select @varchar2 as [Value], hashbytes(‘SHA2_256’,@varchar2) as [Hash]

As you can see here, the resulting Hash is the same for both text values.

Matching Hash = Good

But what if I change the data type? Here I’ve added a third variable with a nvarchar(50) data type instead, but with the same text value.

declare @varchar1 varchar(50), @varchar2 varchar(50), @nvarchar nvarchar(50)
set @varchar1 = ‘Test Hash Text’
set @varchar2 = ‘Test Hash Text’
set @nvarchar = ‘Test Hash Text’

select @varchar1 as [Value], hashbytes(‘SHA2_256’,@varchar1) as [Hash]
union all
select @varchar2 as [Value], hashbytes(‘SHA2_256’,@varchar2) as [Hash]
union all
select @nvarchar as [Value], hashbytes(‘SHA2_256’,@nvarchar) as [Hash]

As you can see here, the resulting Hash is different for the different data type even though the value is the same.

Different Hash = Bad

So what does this mean for our Synapse Distribution? Even if you’re distributing your tables on the same values, if you have a data type mismatch, you’re going to end up with a different Hash and an almost guaranteed different Node for your data. (Due to randomness, you may end up on the same Node by chance as there are a limited number of 60 Nodes, but not for the majority of your data).

To fix this, create a new computed column in your table in Synapse that has the same data type that you want to use across all tables using this same column, and Hash Distribute by that new column. The easiest way to do this is using the Create Table as Select (CTAS) command to create the new table with all of the data and a new data type. Add the new column in the select statement with a CAST/CONVERT to correct data type and use that for Hash Distribution. Additional reading on CTAS available here:

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-as-select-azure-sql-data-warehouse

So I’ve now put this in place and what does my execution plan look like?

No Shuffle!

That’s much better! No shuffle where there shouldn’t be one and query times against this large amount of data take <5% of the time they took before.

Any questions? Let me know in the comments below or contact me on twitter: http://www.twitter.com/PurpleFrogReiss

Power BI Sentinel
The Frog Blog

Team Purple Frog specialise in designing and implementing Microsoft Data Analytics solutions, including Data Warehouses, Cubes, SQL Server, SSIS, ADF, SSAS, Power BI, MDX, DAX, Machine Learning and more.

This is a collection of thoughts, ramblings and ideas that we think would be useful to share.

Authors:

Alex Whittles
(MVP)
Reiss McSporran
Jeet Kainth
Jon Fletcher
Nick Edwards
Joe Billingham

Data Platform MVP

Power BI Sentinel
Frog Blog Out
twitter
rssicon