In this blog post we’ll explore how to trigger a Power BI Dataset via Azure Synapse Analytics in 3 simple steps:
Create a new security group called “AzureSynapsePowerBIIntegration” in your Azure Active Directory.
Next add the name of your Azure Synapse Workspace as a member of your new security group. This will be your service principal.
Login to powerbi.com and in the top right hand corner locate “Settings” and then “Admin portal”
Under “Tenant settings” locate “Developer Settings” and then “Allow service principles to user Power BI APIs”.
Set this service to “Enabled” using the toggle. Next under the heading “Apply to:” select “Specific security groups (Recommended)”. Next add the newly created security group “AzureSynapsePowerBIIntegration” and click apply.
Now all that remains in the power bi service is to assign the security group “Member” workspace access to the dataset you want to refresh.
Login to your Azure Synapse Workspace.
Create a new pipeline called “Power BI Dataset Refresh” and then create a new web activity called “Refresh Dataset”.
Create two new parameters “WorkspaceGuidId” and “DatasetGuidId” both as strings.
Locate the WorkspaceGuidId (also known as the GroupId) and the DatasetGuidId from the URL of the dataset you want to refresh via powerbi.com
The URL will look like this below:
Assign the Workspace/GroupId guid and the DatasetId guid to your newly created parameters.
Next within the settings of the “Web” activity paste the following dynamic content into the URL section.
And set the other options as per below:
System Assigned Managed Identity
And that’s it…
All that is left to do now is debug / trigger the pipeline and see the dataset refresh in powerbi.com!
Some final points:
- If the dataset resides on a non premium capacity you can only schedule up to 8 refreshes per day.
- If the dataset resides on a premium capacity you can schedule up to 48 refreshes per day.
- Any datasets on a premium capacity don’t impose limitations for API refreshes.
- The API refresh data request only works when a published dataset is set to import mode rather than direct query.
- You can switch the method to a “GET” request instead and retrieve the refresh history of the dataset. The output of this could be saved as a variable and then subsequently written back to a database via a stored procedure.
It is always good practice to do as much data preparation as close to the sources as you can before importing or connecting them to your Power BI reports, but what if there are circumstances where this isn’t possible?
I had an issue recently where a third-party application had been updated and both the new and legacy versions were being used side-by-side. Logging data from both versions was being written to two separate Azure SQL databases.
The customer needed a Power BI report showing both old and new logging data sets as a single source. If both databases were SQL Server databases, I could have written a view with a cross-database join, imported that into Power BI and thought no more about it. However, the two sources being Azure SQL Databases, with no easy way to join the tables, caused an issue.
This is where the Merge and Append functions in Power Query come in.
The first step to solve my issue was to create a view in each of the Azure databases that were identical in structure (you’ll see why later on) and import these into Power BI.
Now these data sources (referred to as ‘Queries’ by Power BI) have both been imported into the data model, we have two options with regards to how to combine them, ‘Append’ and ‘Merge’.
Although I didn’t use the ‘Merge’ function, I have included some information about it here as it is still relevant. ‘Merge’ is useful when you have columns from one source that you would like to add to another, the simplest way to think about it is that it works in the same way a JOIN works in SQL, in fact, when you enter the ‘Merge’ wizard there is a ‘Join Kind’ option:
This is how the ‘Merge’ function works:
To solve my issue (explained above), I used the ‘Append’ function. While still combining 2 sources it concatenates one query with another, it’s SQL equivalent would be a UNION ALL. ‘Append’ gives you the option to combine as many tables as you wish, regardless of structure. If a column exists in one query but not another, the column will be filled with NULLS where applicable, this is why it was important to create the two identical views at the very beginning.
This is how the ‘Append’ function works:
When I appended the two sources, I chose the option to create ‘as New’ so I can hide the original 2 queries and I have a nice new table (as seen below) which I can rename, tidy up and make a bit more user friendly, ready for the report consumer to use.
As previously mentioned, data preparation should be done as close to the source as possible, but in situations where this is difficult or simply isn’t possible, it’s important to know about these Power BI functions and how you can use them to your advantage.
More information can be found here:
Append queries – Power Query | Microsoft Docs
Merge queries overview – Power Query | Microsoft Docs
Have you ever been faced with the SSIS error “Row yielded no match during lookup”? If so, this blog is for you!
A customer of ours recently faced the same issue in one of their SSIS packages and asked us to investigate this for them. Initial investigations on their side highlighted that when they replicated the lookup component using a standard join in T-SQL (similar to the image below) it returned the expected results.
So why was SSIS reporting an error and ultimately causing the package to fail?
To demonstrate why they faced this problem, I will provide a worked example. It should be noted that the customer example we were presented with was many thousands of rows, so the problem was much less obvious than the one I have created today.
To begin with, I have created 3 tables in my demo database using a subset of Adventure Works data.
“example.Product” which contains 10 rows and 4 columns. Essentially this could be thought of as my Product Dimension.
“example.Sales” which contains 40 rows and 4 columns. Essentially this could be thought of as my Staging Fact Table.
And finally, “example.FactSales” which contains 40 rows and 4 columns and is the result of the lookup component and ultimately would form the basis of my Sales Fact Table.
Here you will notice that “ProductNumber” has been replaced with “ProductKey”. Behind the scenes in my example SSIS package I am looking up “ProductNumber” from the “example.Product” table to return “ProductKey” in my Sales Fact Table.
*Note: I have specified for the component to fail if it finds any none matching rows.
Below is the data flow task within my example SSIS package.
This package executes without failure and populates my “example.FactSales”. However, I will now replicate the issue our customer in question faced.
To begin I executed the below T-SQL update statement against my “example.Product” table.
This changed the “ProductNumber” from “AR-5381” to “ar-5381” where “ProductKey” was equal to 1. When performing a simple “SELECT TOP 100 * FROM example.Product” we can now see that “ProductNumber” is lower case where “ProductKey” = 1 and the rest of the Product Numbers remain unchanged.
Now when I replicate the existing SSIS package using a standard T-SQL join as before there appears to be no issue – 40 rows are still returned using an inner join.
However, if I now execute the SSIS package the error “Row yielded no match during lookup” is subsequently returned.
So why is this happening?
Quite simply SSIS is case sensitive, accent sensitive and kana sensitive (at least in full cache mode) whereas SQL Server is not.
Now this is a fairly obvious example, but with more complex lookups which may involve multiple columns, it becomes a little more tricky to spot!
To solve this customer example, I used the EXACT function in MS Excel to spot the exact row in the lookup table which was returning the error. The EXACT function compares two text strings and returns “TRUE” if they are exactly the same and “FALSE” if they are not.
Alternatively, you could use SQL Server itself to replicate the lookup behavior of SSIS. Taking the T-SQL script from before, you can alter the existing inner join to a left join and collate using “Latin1_General_BIN” to identify the potential root cause of the lookup failure.
This highlights the importance of cleaning and standardising your lookup columns before using an SSIS lookup data flow component. Here at Purple Frog we tend to use a scalar function to create clean derived columns which are subsequently used for lookup components.
I hope this helps you solve any “Row yielded no match during lookup” errors with your SSIS solution!
So in part 1 we found out what Workload Management was and why we wanted to use it. (If you haven’t read that part, please click here and read that post before you carry on, as we’re jumping in at the deep end!) In this part we’ll be learning how to configure it.
Continuing from the end of the previous post, you have two main options to configure this. First is to assign a user or role to one of the existing system defined workload groups, second is to create a custom workload group and assign a user or role to this instead.
First let’s look at adding a user to an existing workload group. Choose the Workload Group you’d like to add to and click the number in the Classifiers column.
Name – Give the Classifier a name (This can be anything)
Member – Select the user/role that you’d like this to apply to. (Run, SELECT USER_NAME() ,against your SQL Pool to find the actual username authenticated with. If you have sa access, this may be replaced with “sa”).
Request importance – This holds 5 levels, that can be used to prioritise conflicting queries. (Think about how important it is for different roles within your environment to access data quickly and who should go first with conflicts).
Label – When you run a query you can add, OPTION (LABEL = ‘Your_Label_Name’) ,to the end of your queries and it will apply whatever Label Name you put in to your query. If the label matches any of the classifiers then it will use that resource class. (Think about having an ‘ETL’ label or ‘ML’ or something similar).
Start and End Time – The time that the Classifier will be active for. (You may only want to grant increased resource usage during particular times. (Think about an overnight process needing a large amount of resource, compared to the same user/role only needing a reduced amount of resource during the day due to conflicting priorities).
Session Context – When you make a connection to your SQL Pool, rather than having to apply a label to every query, you can instead apply a Session Context that will last for that entire connection. This is a session bound variable rather than the usual batch bound variable. You can create this by running, EXEC sys.sp_set_session_context @key = ‘wlm_context’, @value = ‘Your_Session_Name’ ,at the start of your connection. The session name you enter will then apply to this Classifier.
These can all be combined a variety of ways, but each of them have specific weighting:
These add together, so if you have a user in two different classifiers, one with a label (64 + 16), and the other with a context and time (64 + 8 +4), then the one with the label would be chosen as it has the higher total. 80 to 76.
Secondly creating your own custom Workload Group allows you to be more precise with the allocation of system resources. To create one of these click on the “New workload group” button. There are three suggestions and a custom option. The three suggestions apply some values as a guide, whilst the Custom option just provides default values.
After this is created, you add Classifiers exactly the same as above.
This is a very complex thing to get right, so here are my final thoughts:
Think about system usage, volume of users vs performance of queries. You can’t assign every user 100% of resources. You’ll cause waits all over the place. How many users are querying at once? How complex are the queries?
Create Custom Workload Groups and Assign Classifiers. The system-defined groups are a great starting point to get away from smallrc, but not great for flexibility.
Assign roles to workload groups with different priorities. How important is it that the report loads quickly for an end user vs a manager vs the board..?
Assign times for specific workloads, overnight ELT, large ML query etc. 100% is sometimes okay… maybe… Scale Up, use 100% resource for intensive Transformation or ML query, Scale Down for everyday usage?
Test, monitor, amend, monitor, amend, monitor, amend… Constant Development Cycle. Listen to users, find out what is going wrong, monitor the usage stats during heavy load on the server, change the configuration to improve things.
Additional reading can be done on the Microsoft website: Here
Any questions? Let me know in the comments below or contact me on twitter: @PurpleFrogReiss
So you’ve started using Synapse SQL Pools, you’ve scaled up to improve performance, but your queries aren’t going as fast as you’d hoped. You take a look at resource utilisation and you see that as you scale up, your total resource utilisation per query goes down. But didn’t we scale up to allow our query to use more resource and run quicker? What’s going on here? A little thing called Workload Management.
What is Workload Management? At a very high level, Workload Management allows you to control how your workload uses system resources. The main problem here is defaults!
If you navigate to the Workload Management tab of your SQL Pool and click the “User-defined workload groups : All” button you’ll be able to select the “System-defined workload groups” option.
This will allow you to see the pre-created groups and the relevant stats for what resources they’ll utilise.
Great! So what’s the problem? smallrc is the problem. smallrc is the default Workload Group used for all queries against your SQL Pool. If you look at the chart above, you’ll see that the minimum and maximum resources allocated per query in the Effective Values part of the table show 25.00. This is 25% of the total resources available. This means a single query will never be able to utilise the full performance of the SQL Pool. When we scale, this gets worse. The number in brackets will be the new value. For now we’ll focus on smallrc and xlargerc.
To test this is doing what we expect, I started with my SQL Pool at DW100, then DW500 then DW3000. I ran a single query without storing query results, under each of the xlargerc and then smallrc Workload Groups. The results below show the total utilised resources for each of these queries. As you scale up you can see the disparity in performance can be quite alarming.
The performance of these queries returned a resultset over a set period of time, the larger the returned resultset, the better the performance. smallrc increased slightly with each scaling, whereas the xlargerc increased massively. This graph above and the resulting data I received, show that scaling without changing Workload Management settings will not give you the performance you’re hoping for.
So now we’ve proven the premise and shown the results, how do we actually get this to work? You have two main options, first is to assign a user or role to one of the existing system defined workload groups, (which is what I did for the tests above), second is to create a custom workload group and assign a user or role to this instead.
Part 2 of this blog post goes into more detail on how to configure workload management and the details of the different options and is available: Here. Additional reading can be done on the Microsoft website: Here
Any questions? Let me know in the comments below or contact me on twitter: @PurpleFrogReiss
In Power BI Power Query there is an option to enable or disable whether a table is loaded into the report. The option ‘Enable load’ can be found by right clicking on the table. Typically, by default, the load is already enabled.
There is also an option ‘Include in report refresh’ which lets a user stop a table from refreshing when they refresh the entire report. This maybe useful for static tables or tables that are large which take a long time to refresh and a user wants to concentrate on how other tables are refreshing.
Once a user disables the ‘Enable load’ option, the table name turns italic which is an easy way for users to determine whether a table will be loaded or not.
After applying these changes, no data has been loaded into the report.
To re-enable the load, jump back into Power Query, right click on the table and ‘Enable load’.
Finally, some scenarios where it might be useful to disable loading a table:
– Disable loading tables in Power Query that were only ever stepping stones to create other tables
– See how removing a table effects your report before deleting it
– Removing a table that might be required again in the future
Whilst carrying out some work for a client using Azure Data Factory I was presented with the challenge of triggering different activities depending on the result of a stored procedure.
In fact the challenge posed was to…
- Execute ‘Copy A’ activity if the result of a stored procedure returned (A),
- Execute ‘Copy B’ activity if the result of a stored procedure returned (B),
- Execute ‘Copy C’ activity if the result of a stored procedure returned (C)
- Finally execute an error stored procedure if the result of the stored procedure returned (D).
Prior to the switch statement I could achieve this using 4 ‘IF’ activities connected to a lookup activity as shown in the snip below using my ‘Wait’ example pipeline.
However a neater solution is to use the ‘Switch’ activity to do this work instead. I’ll now jump straight into a worked example to show you how I achieved this.
I created a demo.RandomNumberGenerator stored procedure in our Purple Frog Demo database which declares a variable and sets it equal to a random number between 1 and 4. The stored procedure then uses this variable in a case statement to return a string. So if the @randomnumber variable returns 1 the case statement returns ‘Wait 1 Minute’.
I then used a lookup activity in ADF to return the result of this stored procedure which I then used to ‘feed’ my ‘Switch’ activity.
When I preview this lookup activity it just returns the result of the stored procedure. When this is run time and time again it just returns a random ‘ADFAction’ as named in my case statement generated by my rand() SQL function. So in the scenario above it returned an action of ‘Wait 1 Minute’. If I were to preview the stored procedure again it might return ‘Wait 4 Minutes’ instead.
I can then connect my switch activity to this lookup activity using the ‘success’ green output connector.
I now need to add some dynamic content to refer to the output of my lookup activity. I don’t need to create dynamic nested IF content to say if my stored procedure equals ‘Wait 1 Minute’ then do this, if my stored procedure equals ‘Wait 2 Minutes’ then do this… all I need to do is refer to the output of the connected lookup activity ‘Random Number Lookup Activity’.
So in my example this is simply just @activity(‘Random Number Lookup Activity’).output.firstrow.adfaction. ADF will then match the output of the stored procedure to the case statements I provide it. The default activity is just a set of activities that are executed when the expression evaluation isn’t satisfied. I.e. the output of my the stored procedure doesn’t match any of the case statements I provide it.
In the snip below I have provided 4 case statements which match all of the possible outputs from my stored procedure. Essentially there isn’t any need for default activity as my stored procedure will always return 1 of 4 results but it’s useful to know for future reference.
Within each case statement there is an activity for which I would like ADF to perform if it matches the result of my stored procedure. In my example these are just ‘Wait’ activities. So if the stored procedure returns ‘Wait 1 Minute’ I’ve put a ‘Wait’ activity within the case statement to wait 1 minute (or 60 seconds).
When I first ran this pipeline in debug mode I could see that the output of the stored procedure was ‘Wait 2 Minutes’. The switch activity then matched this to the case statements I provided and performed the activity within the ‘Wait 2 Minutes’ case statement and triggered my ‘Wait 2 Minutes’ activity.
When I debugged another run my stored procedure returned ‘Wait 4 Minutes’. The switch activity then executed my ‘Wait 4 Minutes’ activity. Perfect!
So it’s pretty much as simple that!
I hope this helps all you ADF’ers out there!
I have created a Power BI report which provides detail on the state of processing in a Tabular Cube.
The report uses the cube’s dynamic management views to provide information about the cube’s partitions.
To use the tabular cube processing report, you need to insert the following information:
- Database (Cube Name)
Once entered and assuming the connection is fine you need to accept some native queries. These statements are select statements and will not alter the cube. That should give you a report similar to the one below, I have used an adventure works tabular model as an example.
This report is inspired by one I found for general tabular cube documentation by Data Savvy, which I highly recommend. That report includes information on: tables, columns, relationships, measure and security:
There is now a new tenant setting for Power BI admins to prevent users from creating classic workspaces in Power BI! This includes from places like MS Teams! Hurrah!
If you’re a Power BI Admin you can now enable ‘Block classic workspace creation’.
To do this go to the powerbi.com portal and click the ‘settings’ icon in the top right hand corner of the screen.
Click on ‘Admin portal’ and then locate and click ‘Tenant settings’.
Next under the ‘Workspace settings’ click ‘Block classic workspace creation’.
Now change the toggle from ‘Disabled’ to ‘Enabled’. This will now block the creation of new classic workspaces in your Power BI portal from places like MS Teams.
This will also automatically remove classic workspaces from your Power BI workspace list, but only if they were created by Teams, and only if they’ve never been accessed. If they have been accessed then they’ll be left there for you to deal with manually.
Now you’re probably thinking well I’ll just delete any old remaining workspaces which have been accessed but are no longer being used! Woah hold on…If you do this you’ll end up deleting the underlying office 365 modern group (and the associated Team!) which still may be used.
A better way of doing this is to upgrade the workspace to the new experience and then delete it.
To do this, click on the ellipsis next to the filters button within the workspace you want to delete and click ‘Edit workspace’. Next go to the advanced section and under ‘Upgrade this workspace (preview)’ select ‘Upgrade now’.
A small window will appear in the middle of your screen.
Check the ‘I’m ready to upgrade this workspace’ option and then click ‘Upgrade’.
Now, once the workspace has been upgraded to the new experience you can then click on ‘Settings’ icon and click ‘Delete Workspace’. This will now leave the office 365 modern group untouched but remove it from your workspace. Simple!
As always a big thanks to the guys from at Guy In A Cube (Twitter: https://twitter.com/GuyInACube) on You Tube for sharing this knowledge!
Check out their video here: https://www.youtube.com/watch?v=T2PAL4D2SvU.
Here’s the link to the Microsoft Power BI team’s announcement on this new feature: https://powerbi.microsoft.com/en-us/blog/announcing-you-can-now-block-classic-workspace-creation/
I have recently completed my MCSE in Data Management and Analytics, and I wanted to share my experience of working towards and passing the exams that have led to me getting this MCSE. In this post I will cover some of the challenges I faced and provide some tips and advice, hopefully others following the same certification path, or other paths, will find this useful.
I am a business intelligence developer at Purple Frog, I have several years of experience working with data, I started off as a data analyst and then went into reporting and more recently have been working on ETLs, data warehousing and cubes. I have been working with SQL throughout my various roles and therefore the best place for me to start my certification path was with the exam 70-761: Querying Data with Transact-SQL. The path I’ve taken up to this point is:
MCSA: SQL 2016 Database Development
– 70-761: Querying Data with Transact-SQL
– 70-762: Developing SQL Databases
MCSE: Data Management & Analytics
– 70-767: Implementing a SQL Data Warehouse (from the Elective Exam Pool)
The learning material
Although the first exam (70-761) fitted in quite well with my SQL background (I probably knew about 75% of the material beforehand), there was still some work required for me to get to the stage where I felt I was confident in passing the exam. For me, the best resource and my primary resource for learning the material has been the Exam Ref books, so for example “Exam Ref 70-761 Querying Data with Transact-SQL”. These books are structured in a way that the content is split into the sections covered by the exam, for example the books contain a chapter for every skill covered in the exam.
The one downside to the Exam Ref books is that at times it can feel quite wordy if you’re relying on the book alone, so what I found really useful was to supplement this with videos and demos on the topics where I needed a greater understanding. In addition to this, practice and doing exercises helped me to further understand the different concepts as I was able to try what I had learnt and see where I was going wrong.
The final resource that I found useful was Microsoft Docs (https://docs.microsoft.com/en-us/), this is a really good reference point for key facts, for example I found the page on CDC really useful for my latest exam (70-767).
There are the obvious tips such as sleep early the night before, get to the exam centre with time to spare and so on, but I wanted to share some of the exam techniques I found useful while doing these exams.
My top tip is check that you have covered off and are comfortable with all the skills measured in the exam, the skills measured for each exam can be found in the “Browse Certifications and Exams” section on the Microsoft Learning website (example shown below for exam 70-761). The skills are also stated in the Exam Ref books and as mentioned before the chapters in the book are based on the skills measured in the exam.
What’s useful about the skills measured shown above is that it shows the weight of questions per skill in the exam. This is useful because you can work out if you need to focus on a weaker area if that area is a big part of the exam.
Time shouldn’t be an issue in the exam if you’ve prepared well, however some questions are not worded in the best way and can catch you out so do take the time to read each question properly, and do keep an eye on the time remaining after every 5-10 questions.
You have the option to flag questions and review them again later (note some questions cannot be flagged), make use of these flags for questions you are unsure of. This can be particularly useful if you’ve flagged a question and then a later question gives you a clue or reminds you of the answer for the question flagged earlier. Alternatively, you should be provided with a pen and wipeable board where you can make notes so note down the question number and topic so that you can come back to it later.
I am currently studying towards the exam 70-768: Developing SQL Data Models, this will help develop my understanding and knowledge of data modelling and working with cubes and will also help me get the certification for MCSA: SQL 2016 BI Development. With these current certifications being retired in the near future the next plan is to work towards the certification Microsoft Certified: Azure Data Engineer Associate.
I hope you have found this blog useful and that you can use some of the tips mentioned in your study plans, all the best!