0845 643 64 63

Azure

Azure Synapse Series: What is Workload Management (part1)

So you’ve started using Synapse SQL Pools, you’ve scaled up to improve performance, but your queries aren’t going as fast as you’d hoped. You take a look at resource utilisation and you see that as you scale up, your total resource utilisation per query goes down. But didn’t we scale up to allow our query to use more resource and run quicker? What’s going on here? A little thing called Workload Management.

What is Workload Management? At a very high level, Workload Management allows you to control how your workload uses system resources. The main problem here is defaults!

If you navigate to the Workload Management tab of your SQL Pool and click the “User-defined workload groups : All” button you’ll be able to select the “System-defined workload groups” option.

This will allow you to see the pre-created groups and the relevant stats for what resources they’ll utilise.

Great! So what’s the problem? smallrc is the problem. smallrc is the default Workload Group used for all queries against your SQL Pool. If you look at the chart above, you’ll see that the minimum and maximum resources allocated per query in the Effective Values part of the table show 25.00. This is 25% of the total resources available. This means a single query will never be able to utilise the full performance of the SQL Pool. When we scale, this gets worse. The number in brackets will be the new value. For now we’ll focus on smallrc and xlargerc.

Scaled to DW500 only allows 5% resources for smallrc queries
Scaled to DW3000 only allows 3% resources for smallrc queries.

To test this is doing what we expect, I started with my SQL Pool at DW100, then DW500 then DW3000. I ran a single query without storing query results, under each of the xlargerc and then smallrc Workload Groups. The results below show the total utilised resources for each of these queries. As you scale up you can see the disparity in performance can be quite alarming.

xlargerc then smallrc for each DW100, then DW500, then DW3000.

The performance of these queries returned a resultset over a set period of time, the larger the returned resultset, the better the performance. smallrc increased slightly with each scaling, whereas the xlargerc increased massively. This graph above and the resulting data I received, show that scaling without changing Workload Management settings will not give you the performance you’re hoping for.

So now we’ve proven the premise and shown the results, how do we actually get this to work? You have two main options, first is to assign a user or role to one of the existing system defined workload groups, (which is what I did for the tests above), second is to create a custom workload group and assign a user or role to this instead.

Part 2 of this blog post will go into more detail on how to configure workload management and the details of the different options. For now additional reading can be done on the Microsoft website: Here

Any questions? Let me know in the comments below or contact me on twitter: @PurpleFrogReiss

Azure Synapse Series: Hash Distribution and Shuffle

For this post I’m going to presume you’ve already taken a look at distributing your data using a hash column, and you’re not experiencing the performance you’re expecting. (If you’re not already aware of what this is, take a look at the following link to learn the basics of what a distributed table is and why you need it in Azure Synapse. I’ll be here when you get back.)

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

So now that we’re on the same page, one of the common issues encountered is trying to understand what Hash Distribution actually means and how it functions? In basic terms the column you choose to distribute by gets converted into a hash using a deterministic hash function, which creates the same value for any identical values passed. This places different rows of data on the same compute node, where the column(s) you have used to hash by match. This increases query performance and stops data being passed between Compute Nodes, massively increasing the performance of queries.

Great! So using the knowledge above I’ve created two tables, using a hash distribution and the data that I’m hashing by is the same value in each table. Let’s run an estimated execution plan to see how the joins work!

Shuffle 100%?

Oh dear. Shuffle. On the surface, shuffle is not necessarily a bad thing and may even be expected, but not in this scenario. So what is shuffle and why can it be bad?

Shuffle occurs when a part of a distributed table is moved to a different node during query execution. To do this a hash value is computed using the join columns, the node is then found that has that hash value and the row is then sent to that node for processing. If the tables you’re joining on have different hash distributions, you’d expect this to happen, (hence why it’s not always a bad thing!), but in this case I’ve used the same value to hash both of my tables, so what’s gone wrong?

Two words: Data Types

In my query below I have created two variables of the same data type (varchar(50)), given them the same text value, (Test Hash Text), and then run the hashing function of SHA2_256 over them to see the results. Running this query against your database will provide you with the same result every time when using the same hashing function.

declare @varchar1 varchar(50), @varchar2 varchar(50)
set @varchar1 = ‘Test Hash Text’
set @varchar2 = ‘Test Hash Text’

select @varchar1 as [Value], hashbytes(‘SHA2_256’,@varchar1) as [Hash]
union all
select @varchar2 as [Value], hashbytes(‘SHA2_256’,@varchar2) as [Hash]

As you can see here, the resulting Hash is the same for both text values.

Matching Hash = Good

But what if I change the data type? Here I’ve added a third variable with a nvarchar(50) data type instead, but with the same text value.

declare @varchar1 varchar(50), @varchar2 varchar(50), @nvarchar nvarchar(50)
set @varchar1 = ‘Test Hash Text’
set @varchar2 = ‘Test Hash Text’
set @nvarchar = ‘Test Hash Text’

select @varchar1 as [Value], hashbytes(‘SHA2_256’,@varchar1) as [Hash]
union all
select @varchar2 as [Value], hashbytes(‘SHA2_256’,@varchar2) as [Hash]
union all
select @nvarchar as [Value], hashbytes(‘SHA2_256’,@nvarchar) as [Hash]

As you can see here, the resulting Hash is different for the different data type even though the value is the same.

Different Hash = Bad

So what does this mean for our Synapse Distribution? Even if you’re distributing your tables on the same values, if you have a data type mismatch, you’re going to end up with a different Hash and an almost guaranteed different Node for your data. (Due to randomness, you may end up on the same Node by chance as there are a limited number of 60 Nodes, but not for the majority of your data).

To fix this, create a new computed column in your table in Synapse that has the same data type that you want to use across all tables using this same column, and Hash Distribute by that new column. The easiest way to do this is using the Create Table as Select (CTAS) command to create the new table with all of the data and a new data type. Add the new column in the select statement with a CAST/CONVERT to correct data type and use that for Hash Distribution. Additional reading on CTAS available here:

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-as-select-azure-sql-data-warehouse

So I’ve now put this in place and what does my execution plan look like?

No Shuffle!

That’s much better! No shuffle where there shouldn’t be one and query times against this large amount of data take <5% of the time they took before.

Any questions? Let me know in the comments below or contact me on twitter: http://www.twitter.com/PurpleFrogReiss

Azure Analysis Services Cube Backup

This blog is a quick guide on how to back up an Azure Analysis Services cube using PowerShell. The PowerShell is used in a runbook inside an Automation account in Azure to automate the backup process. A pre-requisite for this is to configure the backup storage settings, this details the location of the backups.

Two parameters are passed into the runbook (shown below), where AnalysisServerName is the server name for the analysis services cube and AutomationCredentialName is the name of the credential for the Automation account in Azure.

The following variable is used to store the current date, obtained using the Get-Date cmdlet. This will be used as part of the backup name.

The command Get-AzureAutomationCredential is used to get the credentials for the Automation account provided, in this case it is one of the parameters provided for the runbook. The result is saved to a variable that will be used as a parameter for the backup command.

And finally, the backup command completes the backup of the analysis services cube to a file. Several parameters are used in this command: -BackupFile is the name given to the backup file, -Name is the name of the cube being backed up, -Server is the server name (passed in as a parameter), -Credential is the credentials obtained in the previous step, and -ApplyCompression enables the backup with compression.

I hope you have found this useful!

ADF Switch Activity – A neat solution for nested IFs.

Whilst carrying out some work for a client using Azure Data Factory I was presented with the challenge of triggering different activities depending on the result of a stored procedure.

In fact the challenge posed was to…

  • Execute ‘Copy A’ activity if the result of a stored procedure returned (A),
  • Execute ‘Copy B’ activity if the result of a stored procedure returned (B),
  • Execute ‘Copy C’ activity if the result of a stored procedure returned (C)
  • Finally execute an error stored procedure if the result of the stored procedure returned (D).

Prior to the switch statement I could achieve this using 4 ‘IF’ activities connected to a lookup activity as shown in the snip below using my ‘Wait’ example pipeline.

However a neater solution is to use the ‘Switch’ activity to do this work instead. I’ll now jump straight into a worked example to show you how I achieved this.

I created a demo.RandomNumberGenerator stored procedure in our Purple Frog Demo database which declares a variable and sets it equal to a random number between 1 and 4. The stored procedure then uses this variable in a case statement to return a string. So if the @randomnumber variable returns 1 the case statement returns ‘Wait 1 Minute’.

I then used a lookup activity in ADF to return the result of this stored procedure which I then used to ‘feed’ my ‘Switch’ activity.

When I preview this lookup activity it just returns the result of the stored procedure. When this is run time and time again it just returns a random ‘ADFAction’ as named in my case statement generated by my rand() SQL function. So in the scenario above it returned an action of ‘Wait 1 Minute’. If I were to preview the stored procedure again it might return ‘Wait 4 Minutes’ instead.

I can then connect my switch activity to this lookup activity using the ‘success’ green output connector.

I now need to add some dynamic content to refer to the output of my lookup activity. I don’t need to create dynamic nested IF content to say if my stored procedure equals ‘Wait 1 Minute’ then do this, if my stored procedure equals ‘Wait 2 Minutes’ then do this… all I need to do is refer to the output of the connected lookup activity ‘Random Number Lookup Activity’.

So in my example this is simply just @activity(‘Random Number Lookup Activity’).output.firstrow.adfaction. ADF will then match the output of the stored procedure to the case statements I provide it. The default activity is just a set of activities that are executed when the expression evaluation isn’t satisfied. I.e. the output of my the stored procedure doesn’t match any of the case statements I provide it.

In the snip below I have provided 4 case statements which match all of the possible outputs from my stored procedure. Essentially there isn’t any need for default activity as my stored procedure will always return 1 of 4 results but it’s useful to know for future reference.

Within each case statement there is an activity for which I would like ADF to perform if it matches the result of my stored procedure. In my example these are just ‘Wait’ activities. So if the stored procedure returns ‘Wait 1 Minute’ I’ve put a ‘Wait’ activity within the case statement to wait 1 minute (or 60 seconds).

When I first ran this pipeline in debug mode I could see that the output of the stored procedure was ‘Wait 2 Minutes’. The switch activity then matched this to the case statements I provided and performed the activity within the ‘Wait 2 Minutes’ case statement and triggered my ‘Wait 2 Minutes’ activity.

When I debugged another run my stored procedure returned ‘Wait 4 Minutes’. The switch activity then executed my ‘Wait 4 Minutes’ activity. Perfect!

So it’s pretty much as simple that!

I hope this helps all you ADF’ers out there!

ADF: Where have my connections gone?

If you’ve been doing any development work in ADF this week you might have noticed that “Connections” has moved. But where has it gone?

The old location

When you click onto “Connections” now you’ll receive the following message:

Management Hub?

Clicking this button takes you to the new Management Hub, the new fourth icon which goes alongside the existing three.

New Manage (Management Hub)

Within this area you now have access to your Linked Services, Integration Runtimes and Triggers. These all have the same options as before, they’ve just moved!

Moved options and New options

You do now have access to more detailed Git configuration options and also the ability to change the Parameterization templates for your ARM templates. This is only accessible when you have a Git repository set up and the Parameter file within it.

You can currently still access Triggers from the main Author window, but I would recommend getting used to finding them in the Management Hub to keep everything together.

Release details of the Management Hub from Microsoft can be found here: https://docs.microsoft.com/en-us/azure/data-factory/author-management-hub

Azure Data Factory (ADF) Parent Activities don’t trigger ON ERROR OUTPUT when Child Activities Error… What happens?

So that title is a big question that a client of ours came across that required some additional investigation… What does happen and how do we get around it?

In the below example we’ll use a simple IF Activity as our Parent Activity. To explain briefly how it functions, you provide it with a query that evaluates to either True of False. This then passes down to a set of Activities If True and a different set of Activities If False. We then have three Outputs, On Success, On Failure and On Completion. This looks like:

(Image 1) Pipeline with If Activity and three outputs

The IF Activity therefore has logically four different things it does given the query criteria:

  1. If the query evaluates correctly, then:
    • If True, Pass to True Activities
    • If False, Pass to False Activities
  2. Error Activity if the query doesn’t evaluate to True or False.

Following that, we have the below options:

  • When 1. If True Activities complete with Success, Trigger On Success Output and On Completion Output
  • When 1. If False Activities complete with Success, Trigger On Success Output and On Completion Output
  • When 2. completes with Error, Trigger On Error Output and On Completion Output

The issue occurs with the below options:
What happens when the If True Activities complete with Error?
What happens when the If False Activities complete with Error?

Logically you would expect that to cause the IF Activity to error as well. Right? WRONG!

Although it shows as if it has an error in the pipeline view:

(Image 2) If Condition Error, without passing down On Success or On Failure?

The details show “Activity failed because an inner activity failed.”:

(Image 3) If Condition Error Message

This means that the IF Activity itself has still started down it’s success pathway and is now just showing the error from further down. This causes the IF Activity to complete with neither Success or Error! (You can see this in Image 2 above). You can also see that the On Completion Output has triggered successfully.

This means that the Error Output of the If Activity can only be used to handle Errors on the Evaluation of the If Query itself. (When it can’t get to true or false, or can’t execute for another reason).

If you want to monitor the steps inside the IF Activity, you’ll need to put those output pathways inside the True and/or False activities themselves and handle this at a deeper level.

I have launched a new twitter account http://www.twitter.com/PurpleFrogReiss so please follow me there for updates on when I post new blog posts!

SSAS Tabular Deployment Wizard fails Newtonsoft.Json error

The Scenario

Deploying an Analysis Services Tabular model to SSAS Azure using the Analysis Services Deployment Wizard. Both Visual Studio 2017 & SQL Server 2017 installed on the client.

Try and click on the ellipses to change the data source connection string or impersonation information results in a Newtonsoft.json error:

“Could not load file or assembly ‘Newtonsoft.Json, Version 6.0.0.0, Culture=neutral, ……”

 

The Solution

If you look at the folder containing Newtonsoft.Json.dll (2C:\Program Files (x86)\Microsoft SQL Server\140\Tools\Binn\ManagementStudio\Extensions\Application” – substitute 140 for your version of SQL Server) and then right click properties on the Newtonsoft.Json.dll file you’ll probably find that it’s version 10.0.3 or later, but the error message shows that the deployment wizard is looking for version 6.0.x. We therefore have to tell the SSAS Deployment Wizard to use a newer version instead.

Use Notepad(++) to open the Microsoft.AnalysisServices.Deployment.exe.config file, which you should find in “C:\Program Files (x86)\Microsoft SQL Server\140\Tools\Binn\ManagementStudio” (substitute 140 as above, for your correct version).

Find the section for Newtonsoft.Json

And make a couple of changes (backup the file first!!!)

  • Add in a bindingRedirect line, redirecting 6.0.0.0 to 10.0.0.0
  • Change the codeBase version number to match the actual file version, 10.0.0.0

Save and close the file, relaunch the deployment wizard, and you should find it all works ok.

Frog-Blog Out

 

Lookups and If Conditions in Azure Data Factory v2 (ADFv2)

Azure Data Factory v2 (ADFv2) has some significant improvements over v1, and we now consider ADF as a viable platform for most of our cloud based projects. But things aren’t always as straightforward as they could be. I’m sure this will improve over time, but don’t let that stop you from getting started now.

This post provides a walk through of using the ‘Lookup’ and ‘If Condition’ activities to do some basic conditional logic depending on the results of a database query.

Assumptions: You already have an ADF pipeline created. If you want to hook into SSIS then you’ll also need the SSIS Integration Runtime set up – although this is not relevant just for the if condition.

Scene setting: We want to execute an SSIS package, but only if the package is enabled in a config table, and in a ‘Waiting’ state. We can determine this state by calling a stored proc in a database.

Solution: Use a ‘Lookup’ activity to call the stored proc, with the single row output saved to a Dataset. Then an ‘If Condition’ activity to check the results of an expression, followed by whatever activities you want to call when the expression equates to True or False. the overall pipeline will look something like this.

Now we’ll look at some of the configuration of the activities. Before we create the pipeline we need to add a Dataset (actually a data source), pointing to the Azure SQL DB that holds our stored proc. Click the + to add a Factory Resource, and select Dataset, followed by Azure SQL Database.

In the ‘Connection’ tab, click ‘+ New’ to create a Linked Service, and enter the connection string details to your Azure SQL DB. You don’t need to select a table, as this dataset will just be for calling a stored proc, so leave it unselected. I’ve called my dataset ‘ComponentState’.

Now add a pipeline by clicking the + in the Factory Resources pane but this time selecting ‘Pipeline’.

Firstly add a Lookup activity, and set its Source Dataset to the one you’ve just created, and then select the stored proc from the dropdown list. Or you could write a query here instead. If you’ve selected a Stored Proc, you can click on ‘Import Parameter’ to pre-populate the parameters into the section below.

Note that the ‘First row only’ checkbox is selected by default. This is correct if you only want a single result row. If you want the full resultset then untick this. (Note that the remaining steps here would require different expressions).

Then add the ‘If Condition’ activity, and link the two by dragging the green box from Lookup onto the If Condition activity:


In the Settings tab, we need to specify the condition expression to use. We want to compare the ‘ComponentState’ field of the proc to the string ‘Waiting’. So we use the equals() function, although we could also use ==.

The expression is:

@{equals(activity(‘Lookup1′).output.firstRow.ComponentState,’Waiting’)}

The @{ } denotes an expression that will be evaluated. Anything outside these brackets will not get evaluated, just left as a string. The first parameter to the equals function is “activity(‘Lookup1’).output.firstRow.ComponentState“, or the ComponentState field of the first row of the output of the Lookup1 activity. We then compare this to the hard coded string ‘Waiting’.

This will either return True or False, the outcomes of which we can configure in the ‘Activities’ tab,

In the ‘If true Activities’ section, just add the ADFv2 activities that you want to happen if your expression evaluates to True, and the same for False. Note that when you edit this, you get a complete canvas to add as many activities you want, I’ve just added one here, to call an SSIS package.

Hope that’s been useful.

<Frog-Blog Out>

Power BI Sentinel
The Frog Blog

Team Purple Frog specialise in designing and implementing Microsoft Data Analytics solutions, including Data Warehouses, Cubes, SQL Server, SSIS, ADF, SSAS, Power BI, MDX, DAX, Machine Learning and more.

This is a collection of thoughts, ramblings and ideas that we think would be useful to share.

Authors:

Alex Whittles
(MVP)
Reiss McSporran
Jeet Kainth
Jon Fletcher
Nick Edwards
Joe Billingham
Microsoft Gold Partner

Data Platform MVP

Power BI Sentinel
Frog Blog Out
twitter
rssicon