0845 643 64 63

Business Intelligence

How to block the creation of a classic workspace in powerbi.com when a new MS Team is created!

There is now a new tenant setting for Power BI admins to prevent users from creating classic workspaces in Power BI! This includes from places like MS Teams! Hurrah!

If you’re a Power BI Admin you can now enable ‘Block classic workspace creation’.

To do this go to the powerbi.com portal and click the ‘settings’ icon in the top right hand corner of the screen.

Click on ‘Admin portal’ and then locate and click ‘Tenant settings’.

Next under the ‘Workspace settings’ click ‘Block classic workspace creation’.

Now change the toggle from ‘Disabled’ to ‘Enabled’. This will now block the creation of new classic workspaces in your Power BI portal from places like MS Teams.

This will also automatically remove classic workspaces from your Power BI workspace list, but only if they were created by Teams, and only if they’ve never been accessed. If they have been accessed then they’ll be left there for you to deal with manually.

Now you’re probably thinking well I’ll just delete any old remaining workspaces which have been accessed but are no longer being used! Woah hold on…If you do this you’ll end up deleting the underlying office 365 modern group (and the associated Team!) which still may be used.

A better way of doing this is to upgrade the workspace to the new experience and then delete it.

To do this, click on the ellipsis next to the filters button within the workspace you want to delete and click ‘Edit workspace’. Next go to the advanced section and under ‘Upgrade this workspace (preview)’ select ‘Upgrade now’.

A small window will appear in the middle of your screen.

Check the ‘I’m ready to upgrade this workspace’ option and then click ‘Upgrade’.

Now, once the workspace has been upgraded to the new experience you can then click on ‘Settings’ icon and click ‘Delete Workspace’. This will now leave the office 365 modern group untouched but remove it from your workspace. Simple!

As always a big thanks to the guys from at Guy In A Cube (Twitter: https://twitter.com/GuyInACube) on You Tube for sharing this knowledge!

Check out their video here: https://www.youtube.com/watch?v=T2PAL4D2SvU.

Here’s the link to the Microsoft Power BI team’s announcement on this new feature: https://powerbi.microsoft.com/en-us/blog/announcing-you-can-now-block-classic-workspace-creation/

My experience in obtaining an MCSE

Background

I have recently completed my MCSE in Data Management and Analytics, and I wanted to share my experience of working towards and passing the exams that have led to me getting this MCSE. In this post I will cover some of the challenges I faced and provide some tips and advice, hopefully others following the same certification path, or other paths, will find this useful.

I am a business intelligence developer at Purple Frog, I have several years of experience working with data, I started off as a data analyst and then went into reporting and more recently have been working on ETLs, data warehousing and cubes. I have been working with SQL throughout my various roles and therefore the best place for me to start my certification path was with the exam 70-761: Querying Data with Transact-SQL. The path I’ve taken up to this point is:

MCSA: SQL 2016 Database Development
– 70-761: Querying Data with Transact-SQL
– 70-762: Developing SQL Databases

MCSE: Data Management & Analytics
– 70-767: Implementing a SQL Data Warehouse (from the Elective Exam Pool)

The learning material

Although the first exam (70-761) fitted in quite well with my SQL background (I probably knew about 75% of the material beforehand), there was still some work required for me to get to the stage where I felt I was confident in passing the exam. For me, the best resource and my primary resource for learning the material has been the Exam Ref books, so for example “Exam Ref 70-761 Querying Data with Transact-SQL”. These books are structured in a way that the content is split into the sections covered by the exam, for example the books contain a chapter for every skill covered in the exam.

The one downside to the Exam Ref books is that at times it can feel quite wordy if you’re relying on the book alone, so what I found really useful was to supplement this with videos and demos on the topics where I needed a greater understanding. In addition to this, practice and doing exercises helped me to further understand the different concepts as I was able to try what I had learnt and see where I was going wrong.

The final resource that I found useful was Microsoft Docs (https://docs.microsoft.com/en-us/), this is a really good reference point for key facts, for example I found the page on CDC really useful for my latest exam (70-767).

The exam

There are the obvious tips such as sleep early the night before, get to the exam centre with time to spare and so on, but I wanted to share some of the exam techniques I found useful while doing these exams.

My top tip is check that you have covered off and are comfortable with all the skills measured in the exam, the skills measured for each exam can be found in the “Browse Certifications and Exams” section on the Microsoft Learning website (example shown below for exam 70-761). The skills are also stated in the Exam Ref books and as mentioned before the chapters in the book are based on the skills measured in the exam.

This was taken from https://docs.microsoft.com/en-us/learn/certifications/exams/70-761

What’s useful about the skills measured shown above is that it shows the weight of questions per skill in the exam. This is useful because you can work out if you need to focus on a weaker area if that area is a big part of the exam.

Time shouldn’t be an issue in the exam if you’ve prepared well, however some questions are not worded in the best way and can catch you out so do take the time to read each question properly, and do keep an eye on the time remaining after every 5-10 questions.

You have the option to flag questions and review them again later (note some questions cannot be flagged), make use of these flags for questions you are unsure of. This can be particularly useful if you’ve flagged a question and then a later question gives you a clue or reminds you of the answer for the question flagged earlier. Alternatively, you should be provided with a pen and wipeable board where you can make notes so note down the question number and topic so that you can come back to it later.

Next steps

I am currently studying towards the exam 70-768: Developing SQL Data Models, this will help develop my understanding and knowledge of data modelling and working with cubes and will also help me get the certification for MCSA: SQL 2016 BI Development. With these current certifications being retired in the near future the next plan is to work towards the certification Microsoft Certified: Azure Data Engineer Associate.

I hope you have found this blog useful and that you can use some of the tips mentioned in your study plans, all the best!

ADF: Where have my connections gone?

If you’ve been doing any development work in ADF this week you might have noticed that “Connections” has moved. But where has it gone?

The old location

When you click onto “Connections” now you’ll receive the following message:

Management Hub?

Clicking this button takes you to the new Management Hub, the new fourth icon which goes alongside the existing three.

New Manage (Management Hub)

Within this area you now have access to your Linked Services, Integration Runtimes and Triggers. These all have the same options as before, they’ve just moved!

Moved options and New options

You do now have access to more detailed Git configuration options and also the ability to change the Parameterization templates for your ARM templates. This is only accessible when you have a Git repository set up and the Parameter file within it.

You can currently still access Triggers from the main Author window, but I would recommend getting used to finding them in the Management Hub to keep everything together.

Release details of the Management Hub from Microsoft can be found here: https://docs.microsoft.com/en-us/azure/data-factory/author-management-hub

Using ConcatenateX in PowerBI to return multiple values.

In this blog post we’ll take a quick look at using ConcatenateX function to view a concatenated string of dates where the max daily sales occurred for a given month.

I came across this function whilst going through the excellent “Mastering DAX 2nd Edition Video Course” by the guys from SQLBI.com. So credit to Marco and Alberto for sharing this.

So how does it work? If we had a list of dates ranging from 01/01/2020 to 31/12/2020 and we wanted to see which days we achieved maximum sales for each given month in a year we could use the ConcatenateX function to return these dates in a single row per month.

As we can see in the screenshot below, the left hand table shows the month of June where we achieved maximum sales for in June on both 18/06/2020 and 25/06/2020 of 99. In the table to the right we can see those two dates presented on a single row for the month of June in the column “What were the max days?”. This was column was created using the ConcatenateX function!

So let us first look at what the maximum daily sales were per month. To do this we’ll use the MAXX function to create “Max Daily Sales”. This returns the maximum daily sales rate achieved for each given month as a single value. So for the month of June this would be 99. The problem with this is we are not sure which days these max sales were achieved on without drilling down into the data.  Was this just one day or was it multiple days? All we can see is a figure of 99.

So let us create a new measure to work out on which days this figure of 99 occurred on.

The variable at point (a) returns a table with a single column which lists all of the unique dates in our Sales_2020 table.

The variable at point (b) returns the max sales for the given filter context in this case month.

The variable at point (c) uses the filter function to filter out only the days where the max sales were achieved by setting the total quantity sold measure to the max daily sales variable. For example in June we achieved max daily sales of 99 on 18/06/2020 and 25/06/2020. Therefore the variable at point (c) would filter out the ListOfDays table variable to just 18/06/2020 and 25/06/2020 only.

If we just had one max sales day per month we could simply return MaxDaysOnly. However we may have multiple days per month where max sales were achieved. Hence we use the ConcatenateX function to create a string of dates.

The variable at point (d) creates a string of concatenated dates separated by the delimiter “,” which can be used against a single row in a table.

Wrapping it all up returns us this table below, which shows us which days max sales were achieved per given month! Pretty cool eh?

Check out https://www.sqlbi.com/articles/mastering-dax-video-course-2nd-edition/ for more information regarding the online DAX course as well as https://twitter.com/marcorus and https://twitter.com/ferrarialberto on Twitter!

Creating calendar tables with DAX using Power BI

By Nick Edwards

In this blog post we’ll take a quick look at creating a self-generating calendar table using DAX.

Dates are important if we want to perform time intelligence reporting on our data i.e. yearly sales, monthly sales, weekly sales, year to date sales or previous year sales.

We’ll be using the calendar function to create our date table, but there are other methods to do this such as CALENDARAUTO or GENERATESERIES.

Here is the syntax we’ll be using to generate our date table.

CALENDAR(<start_date>,<end_date>)

The calendar function returns a single column called “Date” which generates a continuous series of dates from the specified start date to the specified end date. So if we specified the start date to be the 01/01/2020 and the end date to be 31/12/2020 the function would generate 366 rows of distinct date data.

We can then use the add columns function to expand our calendar table further to add specific columns we wish to slice our data by i.e. year, month, quarter, week number, day…

Here is the syntax we will be using to expand our calender table.

ADDCOLUMNS(<table>,<name>,<expression>,[<name>,<expression>]…)

Let’s use an example to further explore how these functions work in practice using some sample adventure works sales data. Here we use the “Get Data” icon to directly query our adventure works database and bring over a sample of sales data, named “AW_Sales”.

Now, to create our calendar table we need to click “New Table” in the modeling tab and enter the following function.

Note: I like to use variables in my date table just to keep the DAX looking clean and fuss free, but this isn’t necessary.

For the start date parameter we have used the FIRSTDATE() and for the end Date parameter we have used the LASTDATE() function. This is so we can extract the first and last “OrderDate” from our AW_Sales table. We could have also used the MIN() and MAX() function to deliver the same results using the newly created “Date” column. As we can see this has generated a sequential date list from the 01/01/2012 to the 31/12/2013 with 731 distinct rows of data.

Now we want to expand our calendar table using the newly created “Date” column with new columns which can slice and dice our data. To do this I use the ADDCOLUMNS function. Here I have added Year, Quarter, Year Month, Month Number, Month Name, Day Of Year, Day Of Month, Day Of Week and Day Name as columns.

Once we’ve done this, we will mark our newly created table as a date table to allow Power BI to recognize date hierarchies and time intelligence functions.

We can now view our newly generated calender table in the data view. Instantly when any new data enters our model from “AW_Sales”, the calendar table will expand accordingly due to the last date function used above.

Now it’s just a case of creating a one to many relationship between our new calendar table and our “AW_Sales” table. We’ll create a one to many relationship between “Calendar[Date]” and “AW_Sales[OrderDate]” as shown below.

Congratulations we have now created a fully fledged calendar table that can slice and dice our “AW_Sales” tale by any of the columns we have created in our calendar table, as shown in the example below.

Sampling data in Data Lake U-SQL for Power BI

Being able to hook Power BI directly into Azure Data Lake Storage (ADLS) is a very powerful tool (and it will be even more so when you can link to ADLS files that are in a different Azure account!! – not yet available as at January 2017). However there is a problem, Data Lake is designed to scale to petabytes of data whereas Power BI has a 10GB limit. Yes this is compressed, so we’d expect around 100GB of raw data, however to load this you need 100GB+ of RAM available on your PC, so it’s hard to actually reach the limit with a single dataset.

There’s obviously a disconnect in scalability here. In some datasets we can just use U-SQL to aggregate the data and pre-summarise by the list of fields that we actually want to analyse, and this is fine for additive transactional data. However if we need a many to many link or the granular details of individual rows of data then there’s an issue, how to we get this data into Power BI?

The answer is sampling, we don’t bring in 100% of the data, but maybe 10%, or 1%, or even 0.01%, it depends how much you need to reduce your dataset. It is however critical to know how to sample data correctly in order to maintain a level of accuracy of data in your reports.

Option 1: Take the top x rows of data
Don’t do it. Ever. Just no.
What if the source data you’ve been given is pre-sorted by product or region, you’d end up with only data from products starting with ‘a’, which would give you some wildly unpredictable results.

Option 2: Take a random % sample
Now we’re talking. This option will take, for example 1 in every 100 rows of data, so it’s picking up an even distribution of data throughout the dataset. This seems a much better option, so how do we do it?

— a) Use ROW_NUMBER() and Modulus
One option would be to include a ROW_NUMBER() windowing function in a U-SQL query that allocates each row a unique number.
ROW_NUMBER() OVER (ORDER BY id) AS rn
We then apply a modulus function to the result, and only take those rows that return a 0
WHERE rn % 100 == 0;

This filters to only 1 in every 100 rows.

This method works in T-SQL, and just as well in U-SQL.

— b) U-SQL SAMPLE

However, there is an easier way. U-SQL contains the ‘SAMPLE’ clause that automates this process. Thanks to Paul (T|B) for spotting this beauty.
SAMPLE Syntax:
SELECT xx FROM xx [SAMPLE [ANY (number of rows) | UNIFORM (percentage of rows)]]

There are two sampling options here, ANY and UNIFORM.

After not being able to find anything on the tinterwebs about them I ran some tests to see what they did and how well do these methods work compared to each other. The following code runs some U-SQL over a simple two column csv file containing an arbitrary id and a name. The 640MB file contains 400 names, each repeated a number of times to build 40m rows. Names were repeated using a normal frequency distribution pattern to make the data more representative of real world data.

To assess the output we can look at the distribution of the sampled data to see how closely it correlates to the distribution of the original dataset.

The U-SQL code looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
DECLARE @INPUT_FILE string = @"/AWPF_Demo/Data/names.csv" ;   //40601850 ROWS
DECLARE @OUTPUT_FILE string = @"/AWPF_Demo/Data/names_results.csv" ;
 
@rawdata = 
    EXTRACT 
        id INT,
        name string
    FROM @INPUT_FILE 
    USING Extractors.Text();
 
//--Allocate each row a row number, and take 1 in every 100
@sequenceddata = 
    SELECT *
        , ROW_NUMBER() OVER (ORDER BY id) AS rn
    FROM @rawdata;
 
 @sampleddata1 = 
    SELECT id, name
    FROM @sequenceddata
    WHERE rn % 100 == 0;
 
 //--Use the SAMPLE ANY clause in U-SQL
 @sampleddata2 = 
    SELECT id, name
    FROM @rawdata
    SAMPLE ANY(406018); //--manually calculated as 1% of the input row count
 
 //--Use the SAMPLE UNIFORM clause in U-SQL
 @sampleddata3 = 
    SELECT id, name
    FROM @rawdata
    SAMPLE UNIFORM(0.01); //--1%
 
//--Find the name distribution of the original data, and sampled datasets
@outputbaseline = 
    SELECT name, COUNT(*) AS Qty
    FROM @rawdata
    GROUP BY name;
 
@outputdata1 =
    SELECT name, COUNT(*) AS Qty
    FROM @sampleddata1
    GROUP BY name;
 
@outputdata2 =
    SELECT name, COUNT(*) AS Qty
    FROM @sampleddata2
    GROUP BY name;
 
@outputdata3 =
    SELECT name, COUNT(*) AS Qty
    FROM @sampleddata3
    GROUP BY name;
 
//--Join all datasets together for analysis
@Output = 
    SELECT b.name
        , b.Qty AS QtyOrig
        , o1.Qty AS QtyMod
        , o2.Qty AS QtyANY
        , o3.Qty AS QtyUNIFORM
    FROM @outputbaseline AS b
        LEFT JOIN @outputdata1 AS o1 ON o1.name==b.name
        LEFT JOIN @outputdata2 AS o2 ON o2.name==b.name
        LEFT JOIN @outputdata3 AS o3 ON o3.name==b.name;
 
//--Output the data
OUTPUT @Output
TO @OUTPUT_FILE
ORDER BY QtyOrig DESC
USING Outputters.Text(quoting:TRUE);

So what happens when we take the resulting data and plot the sampled distributions against each other?

  • The Blue line is the number of times each name appears in the original dataset (on the right axis).
  • The Orange line is the distribution from the ROW_NUMBER() and Modulus.
  • The Yellow line is using U-SQL’s SAMPLE UNIFORM.
  • The Grey line is using U-SQL’s SAMPLE ANY.

As you can see, the SAMPLE ANY is a terrible option to maintain data accuracy. In effect it looks like it just takes the top x rows from the file and discards the rest, which I explained earlier is a bad idea.
However the ROW_NUMBER/Mod and the SAMPLE UNIFORM approaches are both staggeringly accurate to the original, with variances +/-2% for each name. This isn’t any good for exact numerical calculations (total sales £ for example), but for looking at trends over very large datasets this sampling approach is a good option.

So, should you use ROW_NUMBER/Mod or SAMPLE UNIFORM? Obviously SAMPLE UNIORM is simpler code, but how do they perform compared with each other?

  • The ROW_NUMBER/Mod approach, using the above dataset used a single vertex, with a total compute time of 29s, read 640MB and wrote 5KB.
  • The SAMPLE ANY approach used two vertices, with a combined compute time of 2s, read 34MB and wrote 5KB.
  • The SAMPLE UNIFORM approach used four vertices, with a combined compute time of 26s, read 766MB and wrote 5KB.

So the SAMPLE ANY, although poor for data consistency allows a much faster execution by only reading a small section of the data.
The ROW_NUMBER/Mod and SAMPLE UNIFORM approaches are very comparable in terms of performance, so it wouldn’t surprise me if they were doing something similar under the hood. However out of simplicity I’d recommend the SAMPLE UNIFORM method.

Frog-Blog-Out

First thoughts on DataZen

Following the recent announcement of Microsoft acquiring DataZen, I’ve been having a play around to see what it can and can’t do. Here’s a very brief summary so far:

The Good:

  • Very quick and easy design interface
  • Fantastic way of modifying dashboard layout for tablets and mobile devices
  • It seems to ‘just work’ very well
  • You don’t have to worry about the pixel perfect layout of what’s where, it takes care of it for you

Screen shots of the same dashboard running on an iPhone 6 and in a web browser:

datazen01
datazen02

The Bad:

  • The designer is Windows 8 only. Really?!
  • Yes it technically connects to SSAS cubes, but the interface is quite frankly no more than smoke and mirrors.  It connects in the same way as QlikView does; it seems that you have to write an MDX query to return an entire fact table at whatever level of granularity you want to be able to filter by. DataZen then pulls back all of the data and then re-aggregates this for your dashboard. This may be ok for a small cube, but with a large measure group that needs to be sliced/diced by a number of different attributes at the same time this quickly becomes a bad idea. It is not able to use the power of SSAS, it treats it as simply another flat data source.
  • Scalability seems to be limited. For example you can provide a flat dataset containing a key and a parent key, and DataZen will turn it into a tree filter for you. Except that I tried this with a 7000 node parent child hierarchy in a tree and it just couldn’t cope. It did eventually load after numerous attempts, but it was so unresponsive that it was unusable. It seemed to work ok with a few hundred nodes.
  • You don’t have to worry about the pixel perfect layout of what’s where, it takes care of it for you. Yes this is a good point, but it also unfortunately means that there is little scope for customisation. This gets frustrating when it doesn’t ‘just work’. For example resizing a chart with a legend, in some sizes on the mobile view the legend was so big it left no space for the chart. It would be nice to be able to turn off the legend for the mobile view or something similar.
  • SSAS calculation logic is not supported. Well, this is an extension of the smoke and mirrors SSAS implementation, but it’s particularly relevant in this point. One of our existing clients is a heavy dashboard user, with data sourced from SSAS multidimensional cubes. They have a number of KPIs defined, for which the target is the actual value of the previous period. Now SSAS takes care of this beautifully. If the user selects a month then the target is the previous month’s value. If they select a week then the target is the previous week’s value, etc. However DataZen provides SSAS with no context over what is selected, and so SSAS is not able to dynamically do its magic. Therefore KPI targets cannot be dynamic, they need to be static and fixed at the point of DataZen data refresh.
  • Dates from MDX queries don’t seem to want to hook into the Time Navigator filters, and there’s no way of forcing them to. Although I’m guessing that this is a problem with how I’m doing it, so probably not fair to include it here.

Summary

I’ve not been playing with the tool for very long, so I may find ways around all of this. However at the moment I’m concerned that Microsoft are placing their dashboard/analytics future on a product that doesn’t properly support SSAS. In my mind the core strengths of the MS Business Intelligence offering is underpinned by the power of cubes, backed up with SSIS and a the strength of the SQL Server database platform. To buy into a dashboarding platform that doesn’t support and build on this is a cause for concern.

My hope/expectation is that Microsoft take elements from PowerView (that does properly support realtime queries against a cube) and elements from ProClarity/PerformancePoint (e.g. the decomposition tree, etc.), and embed them into DataZen. In which case they could end up with an awesome product.

Yes I’m a cube guy, so I’m naturally focused on the poor SSAS integration. If you don’t use cubes then you’ll probably love it. But MS have some work to do to bring me around.

Anyway, time will tell what happens next… Lets keep fingers crossed

Frog-Blog-Out

BIML – What is it?

I’ve noticed a growing trend over the last year – the ever growing presence of BIML (Business Intelligence Markup Language). So what is it? What does it do? And do you need to learn it?

What is BIML?

Simply, it’s a way of defining the functionality of an SSIS (Integration Services) package. If you’ve ever opened an SSIS .dtsx file in notepad you’ll see a daunting mess of GUIDs that you really don’t want to play around with. BIML is a simple XML format that allows you to write an SSIS package in notepad. When you run a BIML script it creates SSIS packages for you. These can then be opened and edited in BIDS exactly the same as an SSIS package that you’d created manually.

To show the difference, first of all this is a sample BIML script:

BIML

Then, when this is compiled into an SSIS package it looks like this in the front end:

BIML Resulting Package

But this when you open the .dtsx package in notepad:

SSIS

The BIML script is a little easier to digest!

But Why?

But why on earth would you want to do that, when you can just use the BIDS/Visual Studio GUI? The answer is C# and automation. You can mix C# code in with the BIML XML (in a similar way to PHP or old school ASP scripts). This allows you to have a single BIML script, which can apply itself to every item in a list, or every table in a database, and automatically generate all of your SSIS packages from a single template.

Yes, this is very cool stuff.

The following screenshot is the same script as above, but configured to loop through every table in the ‘dim’ schema of a data warehouse, creating a package that truncates the relevant dim table.

The C# script is highlighted in yellow for clarity.

With this, just running the script will create multiple SSIS packages at the click of a button.

How do you create and run a script?

Firstly you need BIDS Helper. But you should have that anyway.

Create a new Integration Services project, then right click on the project and click ‘Add New Biml File’

BIML5

This will add a BIML script file into the Miscellaneous folder of the project.

Once you’ve written a script you can test it (right click on the script and select ‘Check Biml for Errors’, or you can run the script, generating the SSIS packages, by clicking ‘Generate SSIS Packages’.

So, do you need to learn BIML?

I have no doubt that BIML is the future of SSIS. Once you see the full power of if then you’ll never want to go back to manually coding packages again.

If you’re an SSIS pro then there’s a good chance that your next job will require BIML. Or if a potential employer doesn’t ask for it, you can certainly improve your chances of getting the job by selling it (and your skills) to them.

At Purple Frog, all of our SSIS development is now 90% automated using BIML, leaving us more time to focus on the 10% of work that need some custom tweaking or more enhanced logic.

What if you don’t like coding?

Well in that case, check out MIST from Varigence. It’s a GUI for BIML, and a lot more besides. If you’re going to be using BIML a lot then it may well be worth the investment.

<Frog-Blog Out>

MSc Dissertation – Performance of Loading SCDs in SSIS

Well after 3.5 years, I’ve finally completed my MSc Business Intelligence – hoorah! And to reward the time, effort and increased grey hair, they saw fit to give me a merit as well.

During the last year I’ve been writing a thesis investigating the performance characteristics of loading data into data warehouse dimensions. Specifically loading Type 2 SCDs using SSIS.

For those who have followed the previous posts and my conference talks on using T-SQL Merge for the purpose, you won’t be surprised at the direction of the dissertation, but it provides a useful performance comparison between T-SQL Merge, SSIS Merge Join, SSIS Lookup and the SSIS SCD Wizard.

I won’t go into the full details here of the project or results, but will show a couple of the summary charts which are of most interest. You can download the full project here:

The charts below shows the duration taken for the Lookup, Merge and Merge-Join methods (SCD Wizard excluded for obvious reasons!).
The top chart shows the performance on a Raid 10 array of traditional hard disks.
The second chart shows the same tests run on a Fusion IO NAND flash card.

The charts clearly show that the Lookup method is the least favoured. Of the other two, Merge is [just] preferred when using solid state, although statistically they are equivalent. On HDDs, Merge and Merge-Join are equivalent until you’re loading 2-3m rows per batch, at which point Merge-Join becomes the preferred option.

Full test results and analysis in the PDF download above.

My previous few posts show how using a T-SQL approach like Merge can provide huge development benefits by automating the code. This research now shows that unless you’re loading very large data volumes the performance is equivalent to more traditional approaches.

Hope this is of use. If you want to know a bit more without reading the full 99 pages & 23k words (who could blame you?!), then my SQLBits talk video is now on-line here. This talk is slightly out of date as it was presented before I’d finished the research and analysis, but it’s largely accurate. I presented a more up to date version on a webinar for the PASS Virtual BI chapter. The recording isn’t currently available [When this post was written] but should be up soon. Keep checking on the BI PASS Chapter website.

Frog-Blog-Out

SQLBits X Video available

SQLBits X Video Now available

The video of my talk at SQLBits X is now available on the SQLBits website here. The talk was focused on presenting the results of my MSc Business Intelligence dissertation, comparing the performance of different methods of using SSIS to load data warehouse dimensions, specifically type 2 SCDs.

The talk also covers a comparison of the performance between traditional hard disks and solid state storage systems such as Fusion IO.

I then present a method of using the T-SQL Merge statement to automate a significant part of the ETL process.

You can find the code behind the demos on various recent Frog-Blog posts, and there is more to come, so look back soon!

 

PASS BI Virtual Chapter Talk

For those that would rather hear the talk presented live, or want to ask any questions, please join me at a repeat of this talk over Live Meeting for the PASS BI Virtual Chapter on Thursday 28th June, at 5pm UK time, 12pm EDT (US). You can find the details on the PASS BI chapter website here

Power BI Sentinel
The Frog Blog

Team Purple Frog specialise in designing and implementing Microsoft Data Analytics solutions, including Data Warehouses, Cubes, SQL Server, SSIS, ADF, SSAS, Power BI, MDX, DAX, Machine Learning and more.

This is a collection of thoughts, ramblings and ideas that we think would be useful to share.

Authors:

Alex Whittles
(MVP)
Reiss McSporran
Jeet Kainth
Jon Fletcher
Nick Edwards

Data Platform MVP

Power BI Sentinel
Frog Blog Out
twitter
rssicon