0845 643 64 63

Azure

Azure Synapse Series: What is Workload Management (part2)

So in part 1 we found out what Workload Management was and why we wanted to use it. (If you haven’t read that part, please click here and read that post before you carry on, as we’re jumping in at the deep end!) In this part we’ll be learning how to configure it.

Continuing from the end of the previous post, you have two main options to configure this. First is to assign a user or role to one of the existing system defined workload groups, second is to create a custom workload group and assign a user or role to this instead.

First let’s look at adding a user to an existing workload group. Choose the Workload Group you’d like to add to and click the number in the Classifiers column.

Click the Blue “1”
Then “Add classifier”
This is the complicated part!

Name – Give the Classifier a name (This can be anything)

Member – Select the user/role that you’d like this to apply to. (Run, SELECT USER_NAME() ,against your SQL Pool to find the actual username authenticated with. If you have sa access, this may be replaced with “sa”).

Request importance – This holds 5 levels, that can be used to prioritise conflicting queries. (Think about how important it is for different roles within your environment to access data quickly and who should go first with conflicts).

Label – When you run a query you can add, OPTION (LABEL = ‘Your_Label_Name’) ,to the end of your queries and it will apply whatever Label Name you put in to your query. If the label matches any of the classifiers then it will use that resource class. (Think about having an ‘ETL’ label or ‘ML’ or something similar).

Start and End Time – The time that the Classifier will be active for. (You may only want to grant increased resource usage during particular times. (Think about an overnight process needing a large amount of resource, compared to the same user/role only needing a reduced amount of resource during the day due to conflicting priorities).

Session Context – When you make a connection to your SQL Pool, rather than having to apply a label to every query, you can instead apply a Session Context that will last for that entire connection. This is a session bound variable rather than the usual batch bound variable. You can create this by running, EXEC sys.sp_set_session_context @key = ‘wlm_context’, @value = ‘Your_Session_Name’ ,at the start of your connection. The session name you enter will then apply to this Classifier.

These can all be combined a variety of ways, but each of them have specific weighting:

Weights

These add together, so if you have a user in two different classifiers, one with a label (64 + 16), and the other with a context and time (64 + 8 +4), then the one with the label would be chosen as it has the higher total. 80 to 76.

Secondly creating your own custom Workload Group allows you to be more precise with the allocation of system resources. To create one of these click on the “New workload group” button. There are three suggestions and a custom option. The three suggestions apply some values as a guide, whilst the Custom option just provides default values.

Suggestions and Custom

After this is created, you add Classifiers exactly the same as above.

This is a very complex thing to get right, so here are my final thoughts:

Think about system usage, volume of users vs performance of queries. You can’t assign every user 100% of resources. You’ll cause waits all over the place. How many users are querying at once? How complex are the queries?

Create Custom Workload Groups and Assign Classifiers. The system-defined groups are a great starting point to get away from smallrc, but not great for flexibility.

Assign roles to workload groups with different priorities. How important is it that the report loads quickly for an end user vs a manager vs the board..?

Assign times for specific workloads, overnight ELT, large ML query etc. 100% is sometimes okay… maybe… Scale Up, use 100% resource for intensive Transformation or ML query, Scale Down for everyday usage?

Test, monitor, amend, monitor, amend, monitor, amend… Constant Development Cycle. Listen to users, find out what is going wrong, monitor the usage stats during heavy load on the server, change the configuration to improve things.

Additional reading can be done on the Microsoft website: Here

Any questions? Let me know in the comments below or contact me on twitter: @PurpleFrogReiss

Azure Synapse Series: What is Workload Management (part1)

So you’ve started using Synapse SQL Pools, you’ve scaled up to improve performance, but your queries aren’t going as fast as you’d hoped. You take a look at resource utilisation and you see that as you scale up, your total resource utilisation per query goes down. But didn’t we scale up to allow our query to use more resource and run quicker? What’s going on here? A little thing called Workload Management.

What is Workload Management? At a very high level, Workload Management allows you to control how your workload uses system resources. The main problem here is defaults!

If you navigate to the Workload Management tab of your SQL Pool and click the “User-defined workload groups : All” button you’ll be able to select the “System-defined workload groups” option.

This will allow you to see the pre-created groups and the relevant stats for what resources they’ll utilise.

Great! So what’s the problem? smallrc is the problem. smallrc is the default Workload Group used for all queries against your SQL Pool. If you look at the chart above, you’ll see that the minimum and maximum resources allocated per query in the Effective Values part of the table show 25.00. This is 25% of the total resources available. This means a single query will never be able to utilise the full performance of the SQL Pool. When we scale, this gets worse. The number in brackets will be the new value. For now we’ll focus on smallrc and xlargerc.

Scaled to DW500 only allows 5% resources for smallrc queries
Scaled to DW3000 only allows 3% resources for smallrc queries.

To test this is doing what we expect, I started with my SQL Pool at DW100, then DW500 then DW3000. I ran a single query without storing query results, under each of the xlargerc and then smallrc Workload Groups. The results below show the total utilised resources for each of these queries. As you scale up you can see the disparity in performance can be quite alarming.

xlargerc then smallrc for each DW100, then DW500, then DW3000.

The performance of these queries returned a resultset over a set period of time, the larger the returned resultset, the better the performance. smallrc increased slightly with each scaling, whereas the xlargerc increased massively. This graph above and the resulting data I received, show that scaling without changing Workload Management settings will not give you the performance you’re hoping for.

So now we’ve proven the premise and shown the results, how do we actually get this to work? You have two main options, first is to assign a user or role to one of the existing system defined workload groups, (which is what I did for the tests above), second is to create a custom workload group and assign a user or role to this instead.

Part 2 of this blog post goes into more detail on how to configure workload management and the details of the different options and is available: Here. Additional reading can be done on the Microsoft website: Here

Any questions? Let me know in the comments below or contact me on twitter: @PurpleFrogReiss

Azure Synapse Series: Hash Distribution and Shuffle

For this post I’m going to presume you’ve already taken a look at distributing your data using a hash column, and you’re not experiencing the performance you’re expecting. (If you’re not already aware of what this is, take a look at the following link to learn the basics of what a distributed table is and why you need it in Azure Synapse. I’ll be here when you get back.)

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

So now that we’re on the same page, one of the common issues encountered is trying to understand what Hash Distribution actually means and how it functions? In basic terms the column you choose to distribute by gets converted into a hash using a deterministic hash function, which creates the same value for any identical values passed. This places different rows of data on the same compute node, where the column(s) you have used to hash by match. This increases query performance and stops data being passed between Compute Nodes, massively increasing the performance of queries.

Great! So using the knowledge above I’ve created two tables, using a hash distribution and the data that I’m hashing by is the same value in each table. Let’s run an estimated execution plan to see how the joins work!

Shuffle 100%?

Oh dear. Shuffle. On the surface, shuffle is not necessarily a bad thing and may even be expected, but not in this scenario. So what is shuffle and why can it be bad?

Shuffle occurs when a part of a distributed table is moved to a different node during query execution. To do this a hash value is computed using the join columns, the node is then found that has that hash value and the row is then sent to that node for processing. If the tables you’re joining on have different hash distributions, you’d expect this to happen, (hence why it’s not always a bad thing!), but in this case I’ve used the same value to hash both of my tables, so what’s gone wrong?

Two words: Data Types

In my query below I have created two variables of the same data type (varchar(50)), given them the same text value, (Test Hash Text), and then run the hashing function of SHA2_256 over them to see the results. Running this query against your database will provide you with the same result every time when using the same hashing function.

declare @varchar1 varchar(50), @varchar2 varchar(50)
set @varchar1 = ‘Test Hash Text’
set @varchar2 = ‘Test Hash Text’

select @varchar1 as [Value], hashbytes(‘SHA2_256’,@varchar1) as [Hash]
union all
select @varchar2 as [Value], hashbytes(‘SHA2_256’,@varchar2) as [Hash]

As you can see here, the resulting Hash is the same for both text values.

Matching Hash = Good

But what if I change the data type? Here I’ve added a third variable with a nvarchar(50) data type instead, but with the same text value.

declare @varchar1 varchar(50), @varchar2 varchar(50), @nvarchar nvarchar(50)
set @varchar1 = ‘Test Hash Text’
set @varchar2 = ‘Test Hash Text’
set @nvarchar = ‘Test Hash Text’

select @varchar1 as [Value], hashbytes(‘SHA2_256’,@varchar1) as [Hash]
union all
select @varchar2 as [Value], hashbytes(‘SHA2_256’,@varchar2) as [Hash]
union all
select @nvarchar as [Value], hashbytes(‘SHA2_256’,@nvarchar) as [Hash]

As you can see here, the resulting Hash is different for the different data type even though the value is the same.

Different Hash = Bad

So what does this mean for our Synapse Distribution? Even if you’re distributing your tables on the same values, if you have a data type mismatch, you’re going to end up with a different Hash and an almost guaranteed different Node for your data. (Due to randomness, you may end up on the same Node by chance as there are a limited number of 60 Nodes, but not for the majority of your data).

To fix this, create a new computed column in your table in Synapse that has the same data type that you want to use across all tables using this same column, and Hash Distribute by that new column. The easiest way to do this is using the Create Table as Select (CTAS) command to create the new table with all of the data and a new data type. Add the new column in the select statement with a CAST/CONVERT to correct data type and use that for Hash Distribution. Additional reading on CTAS available here:

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-as-select-azure-sql-data-warehouse

So I’ve now put this in place and what does my execution plan look like?

No Shuffle!

That’s much better! No shuffle where there shouldn’t be one and query times against this large amount of data take <5% of the time they took before.

Any questions? Let me know in the comments below or contact me on twitter: http://www.twitter.com/PurpleFrogReiss

Azure Analysis Services Cube Backup

This blog is a quick guide on how to back up an Azure Analysis Services cube using PowerShell. The PowerShell is used in a runbook inside an Automation account in Azure to automate the backup process. A pre-requisite for this is to configure the backup storage settings, this details the location of the backups.

Two parameters are passed into the runbook (shown below), where AnalysisServerName is the server name for the analysis services cube and AutomationCredentialName is the name of the credential for the Automation account in Azure.

The following variable is used to store the current date, obtained using the Get-Date cmdlet. This will be used as part of the backup name.

The command Get-AzureAutomationCredential is used to get the credentials for the Automation account provided, in this case it is one of the parameters provided for the runbook. The result is saved to a variable that will be used as a parameter for the backup command.

And finally, the backup command completes the backup of the analysis services cube to a file. Several parameters are used in this command: -BackupFile is the name given to the backup file, -Name is the name of the cube being backed up, -Server is the server name (passed in as a parameter), -Credential is the credentials obtained in the previous step, and -ApplyCompression enables the backup with compression.

I hope you have found this useful!

SSAS Tabular Deployment Wizard fails Newtonsoft.Json error

The Scenario

Deploying an Analysis Services Tabular model to SSAS Azure using the Analysis Services Deployment Wizard. Both Visual Studio 2017 & SQL Server 2017 installed on the client.

Try and click on the ellipses to change the data source connection string or impersonation information results in a Newtonsoft.json error:

“Could not load file or assembly ‘Newtonsoft.Json, Version 6.0.0.0, Culture=neutral, ……”

 

The Solution

If you look at the folder containing Newtonsoft.Json.dll (2C:\Program Files (x86)\Microsoft SQL Server\140\Tools\Binn\ManagementStudio\Extensions\Application” – substitute 140 for your version of SQL Server) and then right click properties on the Newtonsoft.Json.dll file you’ll probably find that it’s version 10.0.3 or later, but the error message shows that the deployment wizard is looking for version 6.0.x. We therefore have to tell the SSAS Deployment Wizard to use a newer version instead.

Use Notepad(++) to open the Microsoft.AnalysisServices.Deployment.exe.config file, which you should find in “C:\Program Files (x86)\Microsoft SQL Server\140\Tools\Binn\ManagementStudio” (substitute 140 as above, for your correct version).

Find the section for Newtonsoft.Json

And make a couple of changes (backup the file first!!!)

  • Add in a bindingRedirect line, redirecting 6.0.0.0 to 10.0.0.0
  • Change the codeBase version number to match the actual file version, 10.0.0.0

Save and close the file, relaunch the deployment wizard, and you should find it all works ok.

Frog-Blog Out

 

What is U-SQL?

By now you may have heard about U-SQL, the new kid on the query language block. But what is U-SQL? Where is it? What’s it for?

I was lucky enough to be at the 2015 MVP Summit in Redmond, at which one of the sessions was hosted by Michael Rys (@MikeDoesBigData), talking about U-SQL. As it’s creator, there’s no-one better to learn the ropes from. I was pretty much blown away by what it can do and the ease of access, so I’ve spent the last couple of months playing around with it, and doing a few talks at conferences about it.

But there’s a lot of confusion over what it is, so here’s my high level summary of what it is and where it fits, so you can judge whether it’s relevant to you.
U-SQL

What Does U-SQL Look Like?

It’s a hybrid of T-SQL and C#. Think if it as writing a SQL query, but being able to embed the power of C# and the .Net framework.

So you could write a query that looks like this:

1
2
3
4
5
@CustomersClean =
  SELECT *
     ,((CustomerGender=="M" || CustomerGender=="F") ? CustomerGender : "-") AS CustomerGender_Clean,
     ,MyUSQLApp.Udfs.ValidateDate(CustomerDoB) AS CustomerDoB_Clean
  FROM @Customers;

So the basic SELECT * FROM Table syntax is familiar to anyone used to T-SQL. But with the addition of some C# to clean up the Gender and DoB columns. For the DoB, we’re calling a user defined c# function (ValidateDate) that sits in an assembly behind the query.

How do I  use U-SQL?

Data Lake U-SQL is a part of Azure Data Lake. Data Lake is a core component in the Cortana Analytics Suite, Microsoft’s cloud/Azure data platform. Think of Data Lake Storage as a massively scaleable and high performance staging and permanent storage area for all of your structured, semi-structured and non-structured data, based on the Hadoop File System (HDFS). It can store both tables of data and files (csv, text, images, whatever you want to store).

Once your data is held in Data Lake Storage, you can use Data Lake Analytics to do cool stuff with it. This could include ‘Big Data’ processing and analytics, i.e. running large and complex parallel analytics queries over your data, whether that’s MB, GB, TB or even PB. Or it could be transforming and manipulating your data into different structures.

Either way, Data Lake Analytics uses U-SQL as it’s language. At the moment this is the only place that U-SQL exists, but given its power and flexibility I’m hoping it will also have a bright future outside of Azure Data Lake.

You can write and execute U-SQL directly in the Azure Portal, or for a much richer development environment you can install some plugins into Visual Studio 2013 or 2015. I recommend the latter.

Download the Azure SDK for .Net (2.7.2+) here

Download the Data Lake Tools for Visual Studio 2013 or 2015 here

So it’s a Big Data tool?

Yes, and no.

Yes it has the scalability, performance and functionality for Big Data, and this is what it’s designed to do. When you write a simple U-SQL script, the Data Lake Analytics engine takes the query and parallelises it, scaling it up to as many nodes as you want. So it’s scalable. Very scalable. It’s being used now for complex big data analytics jobs with TB of data, and can scale way beyond that.

But it’s more than this, I also see it as a data preparation and transformation tool. If you want an Azure based BI solution, maybe using Azure SQL DW, or even a smaller scale system using Azure SQL DB, then we don’t yet have a PAAS ETL tool in Azure. But now we do! This provides a very powerful and comprehensive platform for all your data transform needs.

What about Azure Data Factory (ADF)? – No, ADF an ETL tool, it’s an orchestration and coordination tool, and isn’t realistic as a data transformation tool.

What about ELT? – Yes that’s fair, to a point. But in my experience the ELT approach actually needs to be ETLT to work properly. The 2nd ‘T’ is taken care of inside the SQL engine, but a lot of source data has to be pre-processed before it can even be landed in a SQL Server table, U-SQL provides that power to us that T-SQL cannot provide.

Summary

I’m not going to go into any more detail here with how to use or write U-SQL, there will be more posts to come on that.

But I hope this provides some clarity on what it is and where it fits in our ever expanding data platform world.

Data Lake went into public preview in October 2015, so it’s ready and waiting for you to have a go!

 

Frog-blog-out

 

Power BI Sentinel
The Frog Blog

Team Purple Frog specialise in designing and implementing Microsoft Data Analytics solutions, including Data Warehouses, Cubes, SQL Server, SSIS, ADF, SSAS, Power BI, MDX, DAX, Machine Learning and more.

This is a collection of thoughts, ramblings and ideas that we think would be useful to share.

Authors:

Alex Whittles
(MVP)
Reiss McSporran
Jeet Kainth
Jon Fletcher
Nick Edwards
Joe Billingham
Microsoft Gold Partner

Data Platform MVP

Power BI Sentinel
Frog Blog Out
twitter
rssicon