0845 643 64 63

Synapse

How to Trigger a Power BI Dataset Refresh via Azure Synapse Analytics in 3 Simple Steps.

In this blog post we’ll explore how to trigger a Power BI Dataset via Azure Synapse Analytics in 3 simple steps:

Step 1:

Create a new security group called “AzureSynapsePowerBIIntegration” in your Azure Active Directory.

Next add the name of your Azure Synapse Workspace as a member of your new security group. This will be your service principal.

Step 2:

Login to powerbi.com and in the top right hand corner locate “Settings” and then “Admin portal”

Under “Tenant settings” locate “Developer Settings” and then “Allow service principles to user Power BI APIs”.

Set this service to “Enabled” using the toggle. Next under the heading “Apply to:” select “Specific security groups (Recommended)”. Next add the newly created security group “AzureSynapsePowerBIIntegration” and click apply.

Now all that remains in the power bi service is to assign the security group “Member” workspace access to the dataset you want to refresh.

Step 3:

Login to your Azure Synapse Workspace.

Create a new pipeline called “Power BI Dataset Refresh” and then create a new web activity called “Refresh Dataset”.

Create two new parameters “WorkspaceGuidId” and “DatasetGuidId” both as strings.

Locate the WorkspaceGuidId (also known as the GroupId) and the DatasetGuidId from the URL of the dataset you want to refresh via powerbi.com

The URL will look like this below:

https://app.powerbi.com/groups/{groupId}/datasets/{datasetId}/details

Assign the Workspace/GroupId guid and the DatasetId guid to your newly created parameters.

Next within the settings of the “Web” activity paste the following dynamic content into the URL section.

@concat(‘https://api.powerbi.com/v1.0/myorg/groups/’,pipeline().parameters.WorkspaceGuidId,’/datasets/’,pipeline().parameters.DatasetGuidId,’/refreshes’)

And set the other options as per below:

Method:

POST

Body:

{}

Authentication:

System Assigned Managed Identity

Resource:

https://analysis.windows.net/powerbi/api

And that’s it…

All that is left to do now is debug / trigger the pipeline and see the dataset refresh in powerbi.com!

Some final points:

  • If the dataset resides on a non premium capacity you can only schedule up to 8 refreshes per day.
  • If the dataset resides on a premium capacity you can schedule up to 48 refreshes per day.
  • Any datasets on a premium capacity don’t impose limitations for API refreshes.
  • The API refresh data request only works when a published dataset is set to import mode rather than direct query.
  • You can switch the method to a “GET” request instead and retrieve the refresh history of the dataset. The output of this could be saved as a variable and then subsequently written back to a database via a stored procedure.

ADF Dataflow CTE workaround

At the time of writing, it is not possible to write a query using a CTE in the source of a dataflow. However, there are a few options to deal with this limitation:

  • re-write the query using subqueries instead of CTEs
  • use a stored procedure that contains the query and reference the stored proc in the source of the dataflow
  • write the query as a view and reference the view in the source of the dataflow (this is my preferred method and the one I will demo here)

I will use the following query purely as an example to demo this:

This query produces the following output:

If I write this query directly inside the source of the data flow as shown below, when trying to import the schema or preview the data I get the error message Incorrect syntax near the keyword ‘WITH’

However, if I create a view for this query and reference the view in the dataflow instead, this works and I can preview the data:

Note, in the source you can also write a query referencing the view (shown below), this is useful if you require additional logic on top of the view.

As mentioned earlier, you can also write the query in a stored procedure and reference this in a similar way to the above. You should now be able to use this simple method to use CTEs in the source for a dataflow.

How to parameterise the Execute Pipeline activity in Azure Synapse Analytics.

Unfortunately, as of April 2022 there is not an option to parameterise or add dynamic content to an “Execute Pipeline” activity to invoke a pipeline run.

However, with the use of a Microsoft API there is method which we can use to overcome this.

In this blog post we’ll use the “Pipeline – Create Pipeline Run” API documented below:

https://docs.microsoft.com/en-us/rest/api/synapse/data-plane/pipeline/create-pipeline-run

Using this API will allow us to dynamically pass in a pipeline name and trigger it. It will also us to dynamically pass in parameters to the called pipelines.

For this example, we’ll create 3 simple pipelines:

  1. “Wait 5 Seconds”
  2. “Wait 10 Seconds”
  3. “Wait 15 Seconds”

Plus a “Control Pipeline” that we’ll get onto later.

Each of the “Wait” pipelines contain a single “Wait” activity and a parameter called “WaitDuration” with a “Type” of “Int”. The idea being that we will dynamically pass in a wait duration from an array.

The “wait time in seconds” option in the settings tab of the “Wait” activity is then parameterised to refer to the parameter as shown below.

Next, we’ll build the “Control Pipeline” based on a “For Each” activity which will loop around the array displayed in notepad++ below. This array references the Pipeline Name that we want to dynamically trigger plus the Wait Duration we want it to wait for.

To use this array we’ll create a “ParameterArray” parameter with “Type” equal to “Array” in the “Control Pipeline”. We’ll set the default value equal to the array from above.

Next, within the settings tab of the “ForEach” activity we have the option of ticking the sequential option and listing the items we want to loop over.

Here we’ve got the “Sequential” option ticked. This just means that it will loop through each item in our array, starting with the very first item. For the “Items” option we’ll add the dynamic content which refers to our “ParameterArray” parameter.

Next, within the “ForEach” activity we’ll create a “Web” activity with the following options.

URL:

https://{Your Synapse Workspace Name}.dev.azuresynapse.net/pipelines/@{item().PipelineName}/createRun?api-version=2020-12-01

For example:

https://mytestsynapseworkspace.dev.azuresynapse.net/pipelines/@{item().PipelineName}/createRun?api-version=2020-12-01

*Notice here that I’ve added @{item().PipelineName} into the string which is our pipeline name from our “ParameterArray” parameter. This is what will make our Web API POST request dynamic.

Method:

POST

Headers:

Name : Content-Type

Value : application/json

Body:

{“WaitDuration”:@{item().WaitDuration}}

*Also notice here that i’ve added @{item().WaitDuration} into the string which is our wait duration from our “ParameterArray” parameter for each item. This allows us to feed each of the pipelines with the wait duration.

Integration runtime:

AutoResolveIntegrationRuntime

Authentication:

System Assigned Managed Identity

Resource:

https://dev.azuresynapse.net/

Once we’ve entered all of these details we should have a “Web” activity which looks similar to the image below:

Now let’s trigger our “Control Pipeline”:

As we can see each our 3 pipelines have been triggered by just triggering the “Control Pipeline”. We can also see that the respective parameters have been passed in successfully to each of the triggered pipelines.

Wait 5 Seconds:

Wait 10 Seconds:

Wait 15 Seconds:

Note, if you’re using Azure DevOps Git CI/CD integration you’ll need to make sure that the pipelines you want to execute already exist in live mode for the API request to recognise that they exist. If not you’ll get an error code suggesting that the pipeline doesn’t exist.

Also note that there may be some Azure permissions that you’ll need to configure with your Azure Administrator in the Azure portal to allow this to work.

In summary, this method is a trigger and forget approach. In this blog there is no consideration in the for each activity to wait for the pipeline to either succeed, fail or timeout. However with the use of the “Pipeline Runs – Get” API and a “Until” activity you can easily make this work.

Execute SQL Task in ADF

If you’re new to ADF or Synapse pipelines and looking for a way to execute SQL commands in pipelines, unfortunately there is no equivalent to the Execute SQL Task in SSIS, but I have found some alternatives which work well for me.

For something as simple as a SELECT statement where you want to return results to be used later, you can use a Lookup activity:

You can then reference the output of this activity in other activities using dynamic content such as @activity(‘LookUp’).output.firstRow.DateKey.

Although you can also use the Lookup activity for DML commands, it is not designed for this and so I’ve found another way to do this. First, create the following stored procedure in the database:

This stored procedure will execute the SQL command passed in via the @sql parameter. Now you can use the Stored procedure activity to point this newly created stored procedure and pass in a SQL command as shown here:

This activity will now execute the SQL command stated in the parameter value. Using this approach helps keep things consistent across your pipelines for SQL commands as they all use the Stored procedure activity, and the only difference is the SQL command used in the parameters section.

You can also add dynamic content to dynamically build up the SQL command that is passed into the stored procedure, this can allow you to use the same configuration for multiple Stored procedure activities but with different outcomes. For example if you had pipeline variables, the above TRUNCATE TABLE command could be replaced with @concat(‘TRUNCATE TABLE ‘,variables(‘SchemaName’),’.’,variables(‘TableName’)) and this would truncate a different table per pipeline based on the variable values.

Note: there is a limit on the number of characters the parameter value can take, so if you have a large SQL command it is best to create a stored procedure specifically for that command and to reference that stored procedure in the activity.

Pretty simple in the end, I hope this helps!

Azure Synapse Series: What is Workload Management (part2)

So in part 1 we found out what Workload Management was and why we wanted to use it. (If you haven’t read that part, please click here and read that post before you carry on, as we’re jumping in at the deep end!) In this part we’ll be learning how to configure it.

Continuing from the end of the previous post, you have two main options to configure this. First is to assign a user or role to one of the existing system defined workload groups, second is to create a custom workload group and assign a user or role to this instead.

First let’s look at adding a user to an existing workload group. Choose the Workload Group you’d like to add to and click the number in the Classifiers column.

Click the Blue “1”
Then “Add classifier”
This is the complicated part!

Name – Give the Classifier a name (This can be anything)

Member – Select the user/role that you’d like this to apply to. (Run, SELECT USER_NAME() ,against your SQL Pool to find the actual username authenticated with. If you have sa access, this may be replaced with “sa”).

Request importance – This holds 5 levels, that can be used to prioritise conflicting queries. (Think about how important it is for different roles within your environment to access data quickly and who should go first with conflicts).

Label – When you run a query you can add, OPTION (LABEL = ‘Your_Label_Name’) ,to the end of your queries and it will apply whatever Label Name you put in to your query. If the label matches any of the classifiers then it will use that resource class. (Think about having an ‘ETL’ label or ‘ML’ or something similar).

Start and End Time – The time that the Classifier will be active for. (You may only want to grant increased resource usage during particular times. (Think about an overnight process needing a large amount of resource, compared to the same user/role only needing a reduced amount of resource during the day due to conflicting priorities).

Session Context – When you make a connection to your SQL Pool, rather than having to apply a label to every query, you can instead apply a Session Context that will last for that entire connection. This is a session bound variable rather than the usual batch bound variable. You can create this by running, EXEC sys.sp_set_session_context @key = ‘wlm_context’, @value = ‘Your_Session_Name’ ,at the start of your connection. The session name you enter will then apply to this Classifier.

These can all be combined a variety of ways, but each of them have specific weighting:

Weights

These add together, so if you have a user in two different classifiers, one with a label (64 + 16), and the other with a context and time (64 + 8 +4), then the one with the label would be chosen as it has the higher total. 80 to 76.

Secondly creating your own custom Workload Group allows you to be more precise with the allocation of system resources. To create one of these click on the “New workload group” button. There are three suggestions and a custom option. The three suggestions apply some values as a guide, whilst the Custom option just provides default values.

Suggestions and Custom

After this is created, you add Classifiers exactly the same as above.

This is a very complex thing to get right, so here are my final thoughts:

Think about system usage, volume of users vs performance of queries. You can’t assign every user 100% of resources. You’ll cause waits all over the place. How many users are querying at once? How complex are the queries?

Create Custom Workload Groups and Assign Classifiers. The system-defined groups are a great starting point to get away from smallrc, but not great for flexibility.

Assign roles to workload groups with different priorities. How important is it that the report loads quickly for an end user vs a manager vs the board..?

Assign times for specific workloads, overnight ELT, large ML query etc. 100% is sometimes okay… maybe… Scale Up, use 100% resource for intensive Transformation or ML query, Scale Down for everyday usage?

Test, monitor, amend, monitor, amend, monitor, amend… Constant Development Cycle. Listen to users, find out what is going wrong, monitor the usage stats during heavy load on the server, change the configuration to improve things.

Additional reading can be done on the Microsoft website: Here

Any questions? Let me know in the comments below or contact me on twitter: @PurpleFrogReiss

Azure Synapse Series: What is Workload Management (part1)

So you’ve started using Synapse SQL Pools, you’ve scaled up to improve performance, but your queries aren’t going as fast as you’d hoped. You take a look at resource utilisation and you see that as you scale up, your total resource utilisation per query goes down. But didn’t we scale up to allow our query to use more resource and run quicker? What’s going on here? A little thing called Workload Management.

What is Workload Management? At a very high level, Workload Management allows you to control how your workload uses system resources. The main problem here is defaults!

If you navigate to the Workload Management tab of your SQL Pool and click the “User-defined workload groups : All” button you’ll be able to select the “System-defined workload groups” option.

This will allow you to see the pre-created groups and the relevant stats for what resources they’ll utilise.

Great! So what’s the problem? smallrc is the problem. smallrc is the default Workload Group used for all queries against your SQL Pool. If you look at the chart above, you’ll see that the minimum and maximum resources allocated per query in the Effective Values part of the table show 25.00. This is 25% of the total resources available. This means a single query will never be able to utilise the full performance of the SQL Pool. When we scale, this gets worse. The number in brackets will be the new value. For now we’ll focus on smallrc and xlargerc.

Scaled to DW500 only allows 5% resources for smallrc queries
Scaled to DW3000 only allows 3% resources for smallrc queries.

To test this is doing what we expect, I started with my SQL Pool at DW100, then DW500 then DW3000. I ran a single query without storing query results, under each of the xlargerc and then smallrc Workload Groups. The results below show the total utilised resources for each of these queries. As you scale up you can see the disparity in performance can be quite alarming.

xlargerc then smallrc for each DW100, then DW500, then DW3000.

The performance of these queries returned a resultset over a set period of time, the larger the returned resultset, the better the performance. smallrc increased slightly with each scaling, whereas the xlargerc increased massively. This graph above and the resulting data I received, show that scaling without changing Workload Management settings will not give you the performance you’re hoping for.

So now we’ve proven the premise and shown the results, how do we actually get this to work? You have two main options, first is to assign a user or role to one of the existing system defined workload groups, (which is what I did for the tests above), second is to create a custom workload group and assign a user or role to this instead.

Part 2 of this blog post goes into more detail on how to configure workload management and the details of the different options and is available: Here. Additional reading can be done on the Microsoft website: Here

Any questions? Let me know in the comments below or contact me on twitter: @PurpleFrogReiss

Azure Synapse Series: Hash Distribution and Shuffle

For this post I’m going to presume you’ve already taken a look at distributing your data using a hash column, and you’re not experiencing the performance you’re expecting. (If you’re not already aware of what this is, take a look at the following link to learn the basics of what a distributed table is and why you need it in Azure Synapse. I’ll be here when you get back.)

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

So now that we’re on the same page, one of the common issues encountered is trying to understand what Hash Distribution actually means and how it functions? In basic terms the column you choose to distribute by gets converted into a hash using a deterministic hash function, which creates the same value for any identical values passed. This places different rows of data on the same compute node, where the column(s) you have used to hash by match. This increases query performance and stops data being passed between Compute Nodes, massively increasing the performance of queries.

Great! So using the knowledge above I’ve created two tables, using a hash distribution and the data that I’m hashing by is the same value in each table. Let’s run an estimated execution plan to see how the joins work!

Shuffle 100%?

Oh dear. Shuffle. On the surface, shuffle is not necessarily a bad thing and may even be expected, but not in this scenario. So what is shuffle and why can it be bad?

Shuffle occurs when a part of a distributed table is moved to a different node during query execution. To do this a hash value is computed using the join columns, the node is then found that has that hash value and the row is then sent to that node for processing. If the tables you’re joining on have different hash distributions, you’d expect this to happen, (hence why it’s not always a bad thing!), but in this case I’ve used the same value to hash both of my tables, so what’s gone wrong?

Two words: Data Types

In my query below I have created two variables of the same data type (varchar(50)), given them the same text value, (Test Hash Text), and then run the hashing function of SHA2_256 over them to see the results. Running this query against your database will provide you with the same result every time when using the same hashing function.

declare @varchar1 varchar(50), @varchar2 varchar(50)
set @varchar1 = ‘Test Hash Text’
set @varchar2 = ‘Test Hash Text’

select @varchar1 as [Value], hashbytes(‘SHA2_256’,@varchar1) as [Hash]
union all
select @varchar2 as [Value], hashbytes(‘SHA2_256’,@varchar2) as [Hash]

As you can see here, the resulting Hash is the same for both text values.

Matching Hash = Good

But what if I change the data type? Here I’ve added a third variable with a nvarchar(50) data type instead, but with the same text value.

declare @varchar1 varchar(50), @varchar2 varchar(50), @nvarchar nvarchar(50)
set @varchar1 = ‘Test Hash Text’
set @varchar2 = ‘Test Hash Text’
set @nvarchar = ‘Test Hash Text’

select @varchar1 as [Value], hashbytes(‘SHA2_256’,@varchar1) as [Hash]
union all
select @varchar2 as [Value], hashbytes(‘SHA2_256’,@varchar2) as [Hash]
union all
select @nvarchar as [Value], hashbytes(‘SHA2_256’,@nvarchar) as [Hash]

As you can see here, the resulting Hash is different for the different data type even though the value is the same.

Different Hash = Bad

So what does this mean for our Synapse Distribution? Even if you’re distributing your tables on the same values, if you have a data type mismatch, you’re going to end up with a different Hash and an almost guaranteed different Node for your data. (Due to randomness, you may end up on the same Node by chance as there are a limited number of 60 Nodes, but not for the majority of your data).

To fix this, create a new computed column in your table in Synapse that has the same data type that you want to use across all tables using this same column, and Hash Distribute by that new column. The easiest way to do this is using the Create Table as Select (CTAS) command to create the new table with all of the data and a new data type. Add the new column in the select statement with a CAST/CONVERT to correct data type and use that for Hash Distribution. Additional reading on CTAS available here:

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-as-select-azure-sql-data-warehouse

So I’ve now put this in place and what does my execution plan look like?

No Shuffle!

That’s much better! No shuffle where there shouldn’t be one and query times against this large amount of data take <5% of the time they took before.

Any questions? Let me know in the comments below or contact me on twitter: http://www.twitter.com/PurpleFrogReiss

Power BI Sentinel
The Frog Blog

Team Purple Frog specialise in designing and implementing Microsoft Data Analytics solutions, including Data Warehouses, Cubes, SQL Server, SSIS, ADF, SSAS, Power BI, MDX, DAX, Machine Learning and more.

This is a collection of thoughts, ramblings and ideas that we think would be useful to share.

Authors:

Alex Whittles
(MVP)
Jeet Kainth
Jon Fletcher
Nick Edwards
Joe Billingham
Lewis Prince
Reiss McSporran
Microsoft Gold Partner

Data Platform MVP

Power BI Sentinel
Frog Blog Out
twitter
rssicon