I’m delighted to have been selected to speak at the Data Platform Summit (DPS), a virtual conference from 2nd to 4th December 2020, with Data, Analytics and Machine Learning content around the clock for a global audience.
I’ll be presenting a session on “Machine Learning in Power BI”, exploring various methods of accessing Machine Learning functionality within the Power BI world, including text analytics, prediction models and more.
There’s a ton of other great sessions also lined up, sign up to reserve your place here!
There is now a new tenant setting for Power BI admins to prevent users from creating classic workspaces in Power BI! This includes from places like MS Teams! Hurrah!
If you’re a Power BI Admin you can now enable ‘Block classic workspace creation’.
To do this go to the powerbi.com portal and click the ‘settings’ icon in the top right hand corner of the screen.
Click on ‘Admin portal’ and then locate and click ‘Tenant settings’.
Next under the ‘Workspace settings’ click ‘Block classic workspace creation’.
Now change the toggle from ‘Disabled’ to ‘Enabled’. This will now block the creation of new classic workspaces in your Power BI portal from places like MS Teams.
This will also automatically remove classic workspaces from your Power BI workspace list, but only if they were created by Teams, and only if they’ve never been accessed. If they have been accessed then they’ll be left there for you to deal with manually.
Now you’re probably thinking well I’ll just delete any old remaining workspaces which have been accessed but are no longer being used! Woah hold on…If you do this you’ll end up deleting the underlying office 365 modern group (and the associated Team!) which still may be used.
A better way of doing this is to upgrade the workspace to the new experience and then delete it.
To do this, click on the ellipsis next to the filters button within the workspace you want to delete and click ‘Edit workspace’. Next go to the advanced section and under ‘Upgrade this workspace (preview)’ select ‘Upgrade now’.
A small window will appear in the middle of your screen.
Check the ‘I’m ready to upgrade this workspace’ option and then click ‘Upgrade’.
Now, once the workspace has been upgraded to the new experience you can then click on ‘Settings’ icon and click ‘Delete Workspace’. This will now leave the office 365 modern group untouched but remove it from your workspace. Simple!
As always a big thanks to the guys from at Guy In A Cube (Twitter: https://twitter.com/GuyInACube) on You Tube for sharing this knowledge!
Check out their video here: https://www.youtube.com/watch?v=T2PAL4D2SvU.
Here’s the link to the Microsoft Power BI team’s announcement on this new feature: https://powerbi.microsoft.com/en-us/blog/announcing-you-can-now-block-classic-workspace-creation/
There is a very serious limitation in the behaviour of calculation groups when using the SELECTEDMEASURE or ISSELECTEDMEASURE functions, and we recommend not using them. Why? If a user creates their own custom calculations within their Power BI report (or composite model) then the value of SELECTEDMEASURE changes, breaking your calculation group logic.
Let me explain with an example:
In a tabular cube we may look to use calculation groups to apply additional filters when calculating measures. In our example we will be using a retail scenario, with budgets as well as ‘Like for Like’ (LFL) budgets. For those not in retail, LFL represents stores that were also open this time last year. We start with the measure “Budget” with the following definition:
We also have a hidden measure called “Budget LFL” with the following definition:
The measure “Budget LFL” is a variation of the “Budget” measure with additional filters applied at source. We also have a calculation group called “LFL Filter”, this is used to return the hidden “Budget LFL” measure based on the option selected in this filter and has the following definition:
This functionality is demonstrated below where the two visuals from Power BI show the same measure with and without the LFL Filter applied:
A problem arises when you try to create a custom measure (in the cube or within a Power BI report) and filter by the calculation group, for example we create a measure called “CustomBudget” which is a copy of the “Budget” measure and has the definition:
Adding this custom measure to the visual shown earlier we can see that the calculation group “LFL Filter” has no affect on the custom measure:
This is because the SELECTEDMEASURE() is now [CustomBudget] and not [Budget], therefore the logic in the calculation group doesn’t recognise the selected measure, and therefore doesn’t switch to the LFL measure.
To get around this we move the bulk of the logic from the calculation group to the measure itself. The measure now changes its behaviour by looking at the selected value of the LFL filter, instead of the calculation group managing it centrally:
We refresh Power BI and find that the results are now as expected, the original and custom measures now match when using the calculation group as a filter:
In this blog post we’ll take a quick look at using ConcatenateX function to view a concatenated string of dates where the max daily sales occurred for a given month.
I came across this function whilst going through the excellent “Mastering DAX 2nd Edition Video Course” by the guys from SQLBI.com. So credit to Marco and Alberto for sharing this.
So how does it work? If we had a list of dates ranging from 01/01/2020 to 31/12/2020 and we wanted to see which days we achieved maximum sales for each given month in a year we could use the ConcatenateX function to return these dates in a single row per month.
As we can see in the screenshot below, the left hand table shows the month of June where we achieved maximum sales for in June on both 18/06/2020 and 25/06/2020 of 99. In the table to the right we can see those two dates presented on a single row for the month of June in the column “What were the max days?”. This was column was created using the ConcatenateX function!
So let us first look at what the maximum daily sales were per month. To do this we’ll use the MAXX function to create “Max Daily Sales”. This returns the maximum daily sales rate achieved for each given month as a single value. So for the month of June this would be 99. The problem with this is we are not sure which days these max sales were achieved on without drilling down into the data. Was this just one day or was it multiple days? All we can see is a figure of 99.
So let us create a new measure to work out on which days this figure of 99 occurred on.
The variable at point (a) returns a table with a single column which lists all of the unique dates in our Sales_2020 table.
The variable at point (b) returns the max sales for the given filter context in this case month.
The variable at point (c) uses the filter function to filter out only the days where the max sales were achieved by setting the total quantity sold measure to the max daily sales variable. For example in June we achieved max daily sales of 99 on 18/06/2020 and 25/06/2020. Therefore the variable at point (c) would filter out the ListOfDays table variable to just 18/06/2020 and 25/06/2020 only.
If we just had one max sales day per month we could simply return MaxDaysOnly. However we may have multiple days per month where max sales were achieved. Hence we use the ConcatenateX function to create a string of dates.
The variable at point (d) creates a string of concatenated dates separated by the delimiter “,” which can be used against a single row in a table.
Wrapping it all up returns us this table below, which shows us which days max sales were achieved per given month! Pretty cool eh?
Check out https://www.sqlbi.com/articles/mastering-dax-video-course-2nd-edition/ for more information regarding the online DAX course as well as https://twitter.com/marcorus and https://twitter.com/ferrarialberto on Twitter!
A common request that is raised by clients is how to sort a table in Power BI by multiple columns, in the same way you can in Excel.
For a while, there was no way (at least no easy way) to do this until the Power BI March 2020 update.
I learnt this tip from the following YouTube video:
Full credit to Dhruvin Shah, check his video out.
Below I have a Power BI table displaying fruit sales, currently unsorted.
To sort the table by Fruit click on the column header Fruit.
The table is now sorted by Fruit in alphabetical order.
To add a secondary sort on Sales, hold the Shift key and click on the column header Sales.
The table is now sorted by
– Fruit name in alphabetical order
– Sales in descending order
Some extras to note:
– There is no limit on the number of columns that can be used to sort a table. Just hold the shift key and keep choosing columns.
– This feature is not available for matrices.
– To switch the sorting from ascending to descending or vice-versa continue to hold shift and click on the column header again.
When querying data from Azure Databricks (spark) into Power BI you may encounter an error: “ODBC:ERROR [HY000] [Microsoft][Hardy] (100) The host and port specified for the connection do not seem to belong to a Spark server. Please check your configuration.“
This is usually caused by trying to connect to a ‘Standard’ Databricks instance, but Power BI (and ODBC in general) can only connect to Databricks using a ‘Premium’ pricing tier. You will need to upgrade this to be able to access Databricks/Spark from Power BI.
Another common and similar error when Power BI is querying data from Azure Databricks (spark) is: “ODBC:ERROR [HY000] [Microsoft][DriverSupport] (1170) Unexpected response received from the server. Please ensure the server host and port specified for the connection are correct.“
The most likely cause of this error is an invalid server path, as you have to modify the path that Databricks gives you before using it.
In Databricks, open the cluster, and in Advanced Options click on the JDBC/ODBC tab, and copy the JDBC URL. It will look something like this:
All of the sections in red need removing or changing, so it ends up like this:
The username needs to be “token”, and the password needs to be a token, generated from the Databricks portal. Click on the user/silhouette icon on the top right of the screen, User Settings, Generate New Token.
Also when using Databricks, watch out for the type of connection you choose in Power BI, it should be ‘Spark’ not ‘Azure HDInsight Spark’.
Hope this helps!
By Nick Edwards
In this blog post we’ll take a quick look at creating a self-generating calendar table using DAX.
Dates are important if we want to perform time intelligence reporting on our data i.e. yearly sales, monthly sales, weekly sales, year to date sales or previous year sales.
We’ll be using the calendar function to create our date table, but there are other methods to do this such as CALENDARAUTO or GENERATESERIES.
Here is the syntax we’ll be using to generate our date table.
The calendar function returns a single column called “Date” which generates a continuous series of dates from the specified start date to the specified end date. So if we specified the start date to be the 01/01/2020 and the end date to be 31/12/2020 the function would generate 366 rows of distinct date data.
We can then use the add columns function to expand our calendar table further to add specific columns we wish to slice our data by i.e. year, month, quarter, week number, day…
Here is the syntax we will be using to expand our calender table.
Let’s use an example to further explore how these functions work in practice using some sample adventure works sales data. Here we use the “Get Data” icon to directly query our adventure works database and bring over a sample of sales data, named “AW_Sales”.
Now, to create our calendar table we need to click “New Table” in the modeling tab and enter the following function.
Note: I like to use variables in my date table just to keep the DAX looking clean and fuss free, but this isn’t necessary.
For the start date parameter we have used the FIRSTDATE() and for the end Date parameter we have used the LASTDATE() function. This is so we can extract the first and last “OrderDate” from our AW_Sales table. We could have also used the MIN() and MAX() function to deliver the same results using the newly created “Date” column. As we can see this has generated a sequential date list from the 01/01/2012 to the 31/12/2013 with 731 distinct rows of data.
Now we want to expand our calendar table using the newly created “Date” column with new columns which can slice and dice our data. To do this I use the ADDCOLUMNS function. Here I have added Year, Quarter, Year Month, Month Number, Month Name, Day Of Year, Day Of Month, Day Of Week and Day Name as columns.
Once we’ve done this, we will mark our newly created table as a date table to allow Power BI to recognize date hierarchies and time intelligence functions.
We can now view our newly generated calender table in the data view. Instantly when any new data enters our model from “AW_Sales”, the calendar table will expand accordingly due to the last date function used above.
Now it’s just a case of creating a one to many relationship between our new calendar table and our “AW_Sales” table. We’ll create a one to many relationship between “Calendar[Date]” and “AW_Sales[OrderDate]” as shown below.
Congratulations we have now created a fully fledged calendar table that can slice and dice our “AW_Sales” tale by any of the columns we have created in our calendar table, as shown in the example below.
First question, why bother with layouts?
Using layouts in Power BI allows a user to make their visuals stand out better, the page looks professional and more appealing to its audience.
Second question, why PowerPoint?
The default page size in Power BI desktop is 16:9, (this trick doesn’t work for other Power BI page sizes), which is identical to a PowerPoint slide.
Therefore whatever is designed in PowerPoint will fit onto a Power BI page perfectly. Also PowerPoint is very easy to use; most people are familiar with it.
You’ll need a Power BI dashboard that you want to add a layout to. I’ll be using a dashboard displaying data on the Premier League top 50 goals scorers.
To start the creation of a layout, open PowerPoint with a blank side and add a rectangle.
Next, add glow to the rectangle, glow is found in the drawing ribbon under shape effects.
Repeat the process of adding rectangles with glow and lay them out so the visuals will fit.
If there is an important visual that needs to be a certain size you can find the size of the visual in Power BI in pixels and tailor the rectangle size to your visual.
Unfortunately PowerPoint displays sizes in cm so the numbers will need converting.
I use the following – https://www.unitconverters.net/typography/pixel-x-to-centimeter.htm
Once happy with the layout, save the slide as a PNG file. Upload this as the background in Power BI by going to format then page background, then selecting your created PNG file.
There you go!! A Power BI dashboard with its own custom layout.
Being able to hook Power BI directly into Azure Data Lake Storage (ADLS) is a very powerful tool (and it will be even more so when you can link to ADLS files that are in a different Azure account!! – not yet available as at January 2017). However there is a problem, Data Lake is designed to scale to petabytes of data whereas Power BI has a 10GB limit. Yes this is compressed, so we’d expect around 100GB of raw data, however to load this you need 100GB+ of RAM available on your PC, so it’s hard to actually reach the limit with a single dataset.
There’s obviously a disconnect in scalability here. In some datasets we can just use U-SQL to aggregate the data and pre-summarise by the list of fields that we actually want to analyse, and this is fine for additive transactional data. However if we need a many to many link or the granular details of individual rows of data then there’s an issue, how to we get this data into Power BI?
The answer is sampling, we don’t bring in 100% of the data, but maybe 10%, or 1%, or even 0.01%, it depends how much you need to reduce your dataset. It is however critical to know how to sample data correctly in order to maintain a level of accuracy of data in your reports.
Option 1: Take the top x rows of data
Don’t do it. Ever. Just no.
What if the source data you’ve been given is pre-sorted by product or region, you’d end up with only data from products starting with ‘a’, which would give you some wildly unpredictable results.
Option 2: Take a random % sample
Now we’re talking. This option will take, for example 1 in every 100 rows of data, so it’s picking up an even distribution of data throughout the dataset. This seems a much better option, so how do we do it?
— a) Use ROW_NUMBER() and Modulus
One option would be to include a ROW_NUMBER() windowing function in a U-SQL query that allocates each row a unique number.
ROW_NUMBER() OVER (ORDER BY id) AS rn
We then apply a modulus function to the result, and only take those rows that return a 0
WHERE rn % 100 == 0;
This filters to only 1 in every 100 rows.
This method works in T-SQL, and just as well in U-SQL.
— b) U-SQL SAMPLE
However, there is an easier way. U-SQL contains the ‘SAMPLE’ clause that automates this process. Thanks to Paul (T|B) for spotting this beauty.
SELECT xx FROM xx [SAMPLE [ANY (number of rows) | UNIFORM (percentage of rows)]]
There are two sampling options here, ANY and UNIFORM.
After not being able to find anything on the tinterwebs about them I ran some tests to see what they did and how well do these methods work compared to each other. The following code runs some U-SQL over a simple two column csv file containing an arbitrary id and a name. The 640MB file contains 400 names, each repeated a number of times to build 40m rows. Names were repeated using a normal frequency distribution pattern to make the data more representative of real world data.
To assess the output we can look at the distribution of the sampled data to see how closely it correlates to the distribution of the original dataset.
The U-SQL code looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
DECLARE @INPUT_FILE string = @"/AWPF_Demo/Data/names.csv" ; //40601850 ROWS DECLARE @OUTPUT_FILE string = @"/AWPF_Demo/Data/names_results.csv" ; @rawdata = EXTRACT id INT, name string FROM @INPUT_FILE USING Extractors.Text(); //--Allocate each row a row number, and take 1 in every 100 @sequenceddata = SELECT * , ROW_NUMBER() OVER (ORDER BY id) AS rn FROM @rawdata; @sampleddata1 = SELECT id, name FROM @sequenceddata WHERE rn % 100 == 0; //--Use the SAMPLE ANY clause in U-SQL @sampleddata2 = SELECT id, name FROM @rawdata SAMPLE ANY(406018); //--manually calculated as 1% of the input row count //--Use the SAMPLE UNIFORM clause in U-SQL @sampleddata3 = SELECT id, name FROM @rawdata SAMPLE UNIFORM(0.01); //--1% //--Find the name distribution of the original data, and sampled datasets @outputbaseline = SELECT name, COUNT(*) AS Qty FROM @rawdata GROUP BY name; @outputdata1 = SELECT name, COUNT(*) AS Qty FROM @sampleddata1 GROUP BY name; @outputdata2 = SELECT name, COUNT(*) AS Qty FROM @sampleddata2 GROUP BY name; @outputdata3 = SELECT name, COUNT(*) AS Qty FROM @sampleddata3 GROUP BY name; //--Join all datasets together for analysis @Output = SELECT b.name , b.Qty AS QtyOrig , o1.Qty AS QtyMod , o2.Qty AS QtyANY , o3.Qty AS QtyUNIFORM FROM @outputbaseline AS b LEFT JOIN @outputdata1 AS o1 ON o1.name==b.name LEFT JOIN @outputdata2 AS o2 ON o2.name==b.name LEFT JOIN @outputdata3 AS o3 ON o3.name==b.name; //--Output the data OUTPUT @Output TO @OUTPUT_FILE ORDER BY QtyOrig DESC USING Outputters.Text(quoting:TRUE);
So what happens when we take the resulting data and plot the sampled distributions against each other?
- The Blue line is the number of times each name appears in the original dataset (on the right axis).
- The Orange line is the distribution from the ROW_NUMBER() and Modulus.
- The Yellow line is using U-SQL’s SAMPLE UNIFORM.
- The Grey line is using U-SQL’s SAMPLE ANY.
As you can see, the SAMPLE ANY is a terrible option to maintain data accuracy. In effect it looks like it just takes the top x rows from the file and discards the rest, which I explained earlier is a bad idea.
However the ROW_NUMBER/Mod and the SAMPLE UNIFORM approaches are both staggeringly accurate to the original, with variances +/-2% for each name. This isn’t any good for exact numerical calculations (total sales £ for example), but for looking at trends over very large datasets this sampling approach is a good option.
So, should you use ROW_NUMBER/Mod or SAMPLE UNIFORM? Obviously SAMPLE UNIORM is simpler code, but how do they perform compared with each other?
- The ROW_NUMBER/Mod approach, using the above dataset used a single vertex, with a total compute time of 29s, read 640MB and wrote 5KB.
- The SAMPLE ANY approach used two vertices, with a combined compute time of 2s, read 34MB and wrote 5KB.
- The SAMPLE UNIFORM approach used four vertices, with a combined compute time of 26s, read 766MB and wrote 5KB.
So the SAMPLE ANY, although poor for data consistency allows a much faster execution by only reading a small section of the data.
The ROW_NUMBER/Mod and SAMPLE UNIFORM approaches are very comparable in terms of performance, so it wouldn’t surprise me if they were doing something similar under the hood. However out of simplicity I’d recommend the SAMPLE UNIFORM method.