Being able to hook Power BI directly into Azure Data Lake Storage (ADLS) is a very powerful tool (and it will be even more so when you can link to ADLS files that are in a different Azure account!! – not yet available as at January 2017). However there is a problem, Data Lake is designed to scale to petabytes of data whereas Power BI has a 10GB limit. Yes this is compressed, so we’d expect around 100GB of raw data, however to load this you need 100GB+ of RAM available on your PC, so it’s hard to actually reach the limit with a single dataset.
There’s obviously a disconnect in scalability here. In some datasets we can just use U-SQL to aggregate the data and pre-summarise by the list of fields that we actually want to analyse, and this is fine for additive transactional data. However if we need a many to many link or the granular details of individual rows of data then there’s an issue, how to we get this data into Power BI?
The answer is sampling, we don’t bring in 100% of the data, but maybe 10%, or 1%, or even 0.01%, it depends how much you need to reduce your dataset. It is however critical to know how to sample data correctly in order to maintain a level of accuracy of data in your reports.
Option 1: Take the top x rows of data
Don’t do it. Ever. Just no.
What if the source data you’ve been given is pre-sorted by product or region, you’d end up with only data from products starting with ‘a’, which would give you some wildly unpredictable results.
Option 2: Take a random % sample
Now we’re talking. This option will take, for example 1 in every 100 rows of data, so it’s picking up an even distribution of data throughout the dataset. This seems a much better option, so how do we do it?
— a) Use ROW_NUMBER() and Modulus
One option would be to include a ROW_NUMBER() windowing function in a U-SQL query that allocates each row a unique number.
ROW_NUMBER() OVER (ORDER BY id) AS rn
We then apply a modulus function to the result, and only take those rows that return a 0
WHERE rn % 100 == 0;
This filters to only 1 in every 100 rows.
This method works in T-SQL, and just as well in U-SQL.
— b) U-SQL SAMPLE
However, there is an easier way. U-SQL contains the ‘SAMPLE’ clause that automates this process. Thanks to Paul (T|B) for spotting this beauty.
SELECT xx FROM xx [SAMPLE [ANY (number of rows) | UNIFORM (percentage of rows)]]
There are two sampling options here, ANY and UNIFORM.
After not being able to find anything on the tinterwebs about them I ran some tests to see what they did and how well do these methods work compared to each other. The following code runs some U-SQL over a simple two column csv file containing an arbitrary id and a name. The 640MB file contains 400 names, each repeated a number of times to build 40m rows. Names were repeated using a normal frequency distribution pattern to make the data more representative of real world data.
To assess the output we can look at the distribution of the sampled data to see how closely it correlates to the distribution of the original dataset.
The U-SQL code looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
DECLARE @INPUT_FILE string = @"/AWPF_Demo/Data/names.csv" ; //40601850 ROWS DECLARE @OUTPUT_FILE string = @"/AWPF_Demo/Data/names_results.csv" ; @rawdata = EXTRACT id INT, name string FROM @INPUT_FILE USING Extractors.Text(); //--Allocate each row a row number, and take 1 in every 100 @sequenceddata = SELECT * , ROW_NUMBER() OVER (ORDER BY id) AS rn FROM @rawdata; @sampleddata1 = SELECT id, name FROM @sequenceddata WHERE rn % 100 == 0; //--Use the SAMPLE ANY clause in U-SQL @sampleddata2 = SELECT id, name FROM @rawdata SAMPLE ANY(406018); //--manually calculated as 1% of the input row count //--Use the SAMPLE UNIFORM clause in U-SQL @sampleddata3 = SELECT id, name FROM @rawdata SAMPLE UNIFORM(0.01); //--1% //--Find the name distribution of the original data, and sampled datasets @outputbaseline = SELECT name, COUNT(*) AS Qty FROM @rawdata GROUP BY name; @outputdata1 = SELECT name, COUNT(*) AS Qty FROM @sampleddata1 GROUP BY name; @outputdata2 = SELECT name, COUNT(*) AS Qty FROM @sampleddata2 GROUP BY name; @outputdata3 = SELECT name, COUNT(*) AS Qty FROM @sampleddata3 GROUP BY name; //--Join all datasets together for analysis @Output = SELECT b.name , b.Qty AS QtyOrig , o1.Qty AS QtyMod , o2.Qty AS QtyANY , o3.Qty AS QtyUNIFORM FROM @outputbaseline AS b LEFT JOIN @outputdata1 AS o1 ON o1.name==b.name LEFT JOIN @outputdata2 AS o2 ON o2.name==b.name LEFT JOIN @outputdata3 AS o3 ON o3.name==b.name; //--Output the data OUTPUT @Output TO @OUTPUT_FILE ORDER BY QtyOrig DESC USING Outputters.Text(quoting:TRUE);
So what happens when we take the resulting data and plot the sampled distributions against each other?
- The Blue line is the number of times each name appears in the original dataset (on the right axis).
- The Orange line is the distribution from the ROW_NUMBER() and Modulus.
- The Yellow line is using U-SQL’s SAMPLE UNIFORM.
- The Grey line is using U-SQL’s SAMPLE ANY.
As you can see, the SAMPLE ANY is a terrible option to maintain data accuracy. In effect it looks like it just takes the top x rows from the file and discards the rest, which I explained earlier is a bad idea.
However the ROW_NUMBER/Mod and the SAMPLE UNIFORM approaches are both staggeringly accurate to the original, with variances +/-2% for each name. This isn’t any good for exact numerical calculations (total sales £ for example), but for looking at trends over very large datasets this sampling approach is a good option.
So, should you use ROW_NUMBER/Mod or SAMPLE UNIFORM? Obviously SAMPLE UNIORM is simpler code, but how do they perform compared with each other?
- The ROW_NUMBER/Mod approach, using the above dataset used a single vertex, with a total compute time of 29s, read 640MB and wrote 5KB.
- The SAMPLE ANY approach used two vertices, with a combined compute time of 2s, read 34MB and wrote 5KB.
- The SAMPLE UNIFORM approach used four vertices, with a combined compute time of 26s, read 766MB and wrote 5KB.
So the SAMPLE ANY, although poor for data consistency allows a much faster execution by only reading a small section of the data.
The ROW_NUMBER/Mod and SAMPLE UNIFORM approaches are very comparable in terms of performance, so it wouldn’t surprise me if they were doing something similar under the hood. However out of simplicity I’d recommend the SAMPLE UNIFORM method.
Following the recent announcement of Microsoft acquiring DataZen, I’ve been having a play around to see what it can and can’t do. Here’s a very brief summary so far:
- Very quick and easy design interface
- Fantastic way of modifying dashboard layout for tablets and mobile devices
- It seems to ‘just work’ very well
- You don’t have to worry about the pixel perfect layout of what’s where, it takes care of it for you
Screen shots of the same dashboard running on an iPhone 6 and in a web browser:
- The designer is Windows 8 only. Really?!
- Yes it technically connects to SSAS cubes, but the interface is quite frankly no more than smoke and mirrors. It connects in the same way as QlikView does; it seems that you have to write an MDX query to return an entire fact table at whatever level of granularity you want to be able to filter by. DataZen then pulls back all of the data and then re-aggregates this for your dashboard. This may be ok for a small cube, but with a large measure group that needs to be sliced/diced by a number of different attributes at the same time this quickly becomes a bad idea. It is not able to use the power of SSAS, it treats it as simply another flat data source.
- Scalability seems to be limited. For example you can provide a flat dataset containing a key and a parent key, and DataZen will turn it into a tree filter for you. Except that I tried this with a 7000 node parent child hierarchy in a tree and it just couldn’t cope. It did eventually load after numerous attempts, but it was so unresponsive that it was unusable. It seemed to work ok with a few hundred nodes.
- You don’t have to worry about the pixel perfect layout of what’s where, it takes care of it for you. Yes this is a good point, but it also unfortunately means that there is little scope for customisation. This gets frustrating when it doesn’t ‘just work’. For example resizing a chart with a legend, in some sizes on the mobile view the legend was so big it left no space for the chart. It would be nice to be able to turn off the legend for the mobile view or something similar.
- SSAS calculation logic is not supported. Well, this is an extension of the smoke and mirrors SSAS implementation, but it’s particularly relevant in this point. One of our existing clients is a heavy dashboard user, with data sourced from SSAS multidimensional cubes. They have a number of KPIs defined, for which the target is the actual value of the previous period. Now SSAS takes care of this beautifully. If the user selects a month then the target is the previous month’s value. If they select a week then the target is the previous week’s value, etc. However DataZen provides SSAS with no context over what is selected, and so SSAS is not able to dynamically do its magic. Therefore KPI targets cannot be dynamic, they need to be static and fixed at the point of DataZen data refresh.
- Dates from MDX queries don’t seem to want to hook into the Time Navigator filters, and there’s no way of forcing them to. Although I’m guessing that this is a problem with how I’m doing it, so probably not fair to include it here.
I’ve not been playing with the tool for very long, so I may find ways around all of this. However at the moment I’m concerned that Microsoft are placing their dashboard/analytics future on a product that doesn’t properly support SSAS. In my mind the core strengths of the MS Business Intelligence offering is underpinned by the power of cubes, backed up with SSIS and a the strength of the SQL Server database platform. To buy into a dashboarding platform that doesn’t support and build on this is a cause for concern.
My hope/expectation is that Microsoft take elements from PowerView (that does properly support realtime queries against a cube) and elements from ProClarity/PerformancePoint (e.g. the decomposition tree, etc.), and embed them into DataZen. In which case they could end up with an awesome product.
Yes I’m a cube guy, so I’m naturally focused on the poor SSAS integration. If you don’t use cubes then you’ll probably love it. But MS have some work to do to bring me around.
Anyway, time will tell what happens next… Lets keep fingers crossed
I’ve noticed a growing trend over the last year – the ever growing presence of BIML (Business Intelligence Markup Language). So what is it? What does it do? And do you need to learn it?
What is BIML?
Simply, it’s a way of defining the functionality of an SSIS (Integration Services) package. If you’ve ever opened an SSIS .dtsx file in notepad you’ll see a daunting mess of GUIDs that you really don’t want to play around with. BIML is a simple XML format that allows you to write an SSIS package in notepad. When you run a BIML script it creates SSIS packages for you. These can then be opened and edited in BIDS exactly the same as an SSIS package that you’d created manually.
To show the difference, first of all this is a sample BIML script:
Then, when this is compiled into an SSIS package it looks like this in the front end:
But this when you open the .dtsx package in notepad:
The BIML script is a little easier to digest!
But why on earth would you want to do that, when you can just use the BIDS/Visual Studio GUI? The answer is C# and automation. You can mix C# code in with the BIML XML (in a similar way to PHP or old school ASP scripts). This allows you to have a single BIML script, which can apply itself to every item in a list, or every table in a database, and automatically generate all of your SSIS packages from a single template.
Yes, this is very cool stuff.
The following screenshot is the same script as above, but configured to loop through every table in the ‘dim’ schema of a data warehouse, creating a package that truncates the relevant dim table.
The C# script is highlighted in yellow for clarity.
With this, just running the script will create multiple SSIS packages at the click of a button.
How do you create and run a script?
Firstly you need BIDS Helper. But you should have that anyway.
Create a new Integration Services project, then right click on the project and click ‘Add New Biml File’
This will add a BIML script file into the Miscellaneous folder of the project.
Once you’ve written a script you can test it (right click on the script and select ‘Check Biml for Errors’, or you can run the script, generating the SSIS packages, by clicking ‘Generate SSIS Packages’.
So, do you need to learn BIML?
I have no doubt that BIML is the future of SSIS. Once you see the full power of if then you’ll never want to go back to manually coding packages again.
If you’re an SSIS pro then there’s a good chance that your next job will require BIML. Or if a potential employer doesn’t ask for it, you can certainly improve your chances of getting the job by selling it (and your skills) to them.
At Purple Frog, all of our SSIS development is now 90% automated using BIML, leaving us more time to focus on the 10% of work that need some custom tweaking or more enhanced logic.
What if you don’t like coding?
Well in that case, check out MIST from Varigence. It’s a GUI for BIML, and a lot more besides. If you’re going to be using BIML a lot then it may well be worth the investment.
Well after 3.5 years, I’ve finally completed my MSc Business Intelligence – hoorah! And to reward the time, effort and increased grey hair, they saw fit to give me a merit as well.
During the last year I’ve been writing a thesis investigating the performance characteristics of loading data into data warehouse dimensions. Specifically loading Type 2 SCDs using SSIS.
For those who have followed the previous posts and my conference talks on using T-SQL Merge for the purpose, you won’t be surprised at the direction of the dissertation, but it provides a useful performance comparison between T-SQL Merge, SSIS Merge Join, SSIS Lookup and the SSIS SCD Wizard.
I won’t go into the full details here of the project or results, but will show a couple of the summary charts which are of most interest. You can download the full project here:
- PDF: Performance comparison of techniques to load Type 2 slowly changing dimensions in a Kimball style data warehouse using SSIS
The charts below shows the duration taken for the Lookup, Merge and Merge-Join methods (SCD Wizard excluded for obvious reasons!).
The top chart shows the performance on a Raid 10 array of traditional hard disks.
The second chart shows the same tests run on a Fusion IO NAND flash card.
The charts clearly show that the Lookup method is the least favoured. Of the other two, Merge is [just] preferred when using solid state, although statistically they are equivalent. On HDDs, Merge and Merge-Join are equivalent until you’re loading 2-3m rows per batch, at which point Merge-Join becomes the preferred option.
Full test results and analysis in the PDF download above.
My previous few posts show how using a T-SQL approach like Merge can provide huge development benefits by automating the code. This research now shows that unless you’re loading very large data volumes the performance is equivalent to more traditional approaches.
Hope this is of use. If you want to know a bit more without reading the full 99 pages & 23k words (who could blame you?!), then my SQLBits talk video is now on-line here. This talk is slightly out of date as it was presented before I’d finished the research and analysis, but it’s largely accurate. I presented a more up to date version on a webinar for the PASS Virtual BI chapter. The recording isn’t currently available [When this post was written] but should be up soon. Keep checking on the BI PASS Chapter website.
SQLBits X Video Now available
The video of my talk at SQLBits X is now available on the SQLBits website here. The talk was focused on presenting the results of my MSc Business Intelligence dissertation, comparing the performance of different methods of using SSIS to load data warehouse dimensions, specifically type 2 SCDs.
The talk also covers a comparison of the performance between traditional hard disks and solid state storage systems such as Fusion IO.
I then present a method of using the T-SQL Merge statement to automate a significant part of the ETL process.
You can find the code behind the demos on various recent Frog-Blog posts, and there is more to come, so look back soon!
- Introduction to T-SQL Merge Basics
- Using T-SQL Merge to Load SCD Dimensions
- Automating T-SQL Merge to Load SCD Dimensions
PASS BI Virtual Chapter Talk
For those that would rather hear the talk presented live, or want to ask any questions, please join me at a repeat of this talk over Live Meeting for the PASS BI Virtual Chapter on Thursday 28th June, at 5pm UK time, 12pm EDT (US). You can find the details on the PASS BI chapter website here
This is the 3rd post in the Frog-Blog series on the awesomeness of T-SQL Merge.
- Post 1: Introduction to T-SQL merge basics
- Post 2: Using T-SQL merge to load data warehouse dimensions
In this post we’ll be looking at how we can automate the creation of the merge statement to reduce development time and improve reliability and flexibility of the ETL process. I discussed this in the 2nd half of a talk I gave at the UK technical launch of SQL Server 2012 at SQLBits X. Thank you to the great audience who came to that talk, this post is for your benefit and is a result of the feedback and requests from you guys.
Why automate merge?
As we saw in the previous post, merge is an incredibly powerful tool when loading data into data warehouse dimensions (specifically SCDs – slowly changing dimensions). The whole process can be wrapped up into a very neat stored proc which can save a considerable mount of time writing the equivalent functionality in SSIS. In the next installment of this series I’ll be discussing the performance of it compared to other methods of loading SCDs in SSIS (take a look at the SQLBits talk video [when it’s released] for a preview!). Suffice to say for now that in my [pretty comprehensive] tests it’s one of the fastest methods of loading SCDs.
If you missed the talk, you can download the slide deck here whilst you’re waiting for the video.
The problem that stops a lot of people using merge is the perceived complexity of the statement. It can be very easy to get things wrong, with pretty bad consequences on your dimension data.
The easiest way to avoid this complexity and simplify the process is to not write merge statements, but let an automated procedure to it for you – Simples!.
The other huge benefit is that, as we’ll see during this post, you can base the automation procedure on metadata, meaning that you can change the SCD functionality of your dimensions just by changing metadata, and not rewriting your code.
Note that in this post we’ll just be looking at Type 0 and 1 SCDs, not 2, 3 or 6. This is to keep things simple. Once you’ve mastered type 0 and 1, it’s a logical next step to expand things to deal with type 2s.
OK, so how do we do this?
First of all we need to set up two tables to use. Let’s create a simple Customer dimension. Alongside this we also need a staging table. I’m a big fan of using schemas to differentiate tables, so we’ll create dim.Customer and etl.Customer as our two tables.
CREATE SCHEMA [dim] AUTHORIZATION [dbo] GO CREATE SCHEMA [etl] AUTHORIZATION [dbo] GO CREATE TABLE [dim].[Customer]( [CustomerKey] [int] IDENTITY(1,1) NOT NULL, [Email] [varchar](255) NOT NULL, [FirstName] [varchar](50) NOT NULL, [LastName] [varchar](50) NOT NULL, [DoB] [date] NOT NULL, [Sex] [char](1) NOT NULL, [MaritalStatus] [varchar](10) NOT NULL, [FirstCreated] [date] NOT NULL, [IsRowCurrent] [bit] NOT NULL, [ValidFrom] [datetime] NOT NULL, [ValidTo] [datetime] NOT NULL, [LastUpdated] [datetime] NOT NULL CONSTRAINT [PK_DimCustomer] PRIMARY KEY CLUSTERED ( [CustomerKey] ASC )) GO CREATE TABLE [etl].[Customer]( [Email] [varchar](255) NOT NULL, [FirstName] [varchar](50) NOT NULL, [LastName] [varchar](50) NOT NULL, [DoB] [date] NOT NULL, [Sex] [char](1) NOT NULL, [MaritalStatus] [varchar](10) NOT NULL, [FirstCreated] [date] NOT NULL )
So the dim table contains our primary surrogate key, business key (email address in this case), customer details and a series of audit fields (IsRowCurrent, ValidFrom, etc.). The etl staging table only contains the business key and customer details.
We then need to store the details of each field. i.e. how should each field be interpreted – is it a primary key, business, key, type 0 or 1, or an audit field. We need this so that we can put the correct fields into the correct place in the merge statement. You could create a table to store this information, however I prefer to use the extended properties of the fields.
EXEC sys.sp_addextendedproperty @level2name=N'CustomerKey', @value=N'PK' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'Email', @value=N'BK' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'FirstName', @value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'LastName', @value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'DoB', @value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'Sex', @value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'MaritalStatus',@value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'FirstCreated', @value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'ValidFrom', @value=N'Audit' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'ValidTo', @value=N'Audit' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'IsRowCurrent', @value=N'Audit' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'LastUpdated', @value=N'Audit' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
Or you can obviously just enter the extended property manually using SSMS
The SSIS package should output all customer records into the etl table, with no regard for whether they are new customers, old customers, changed or not. The merge statement will take care of that.
The first stage is to examine the structure of merge.
MERGE [DIMENSION TABLE] as Target USING [STAGING TABLE] as Source ON [LIST OF BUSINESS KEY FIELDS] WHEN MATCHED AND Target.[LIST OF TYPE 1 FIELDS] <> Source.[LIST OF TYPE 1 FIELDS] THEN UPDATE SET [LIST OF TYPE 1 FIELDS] = Source.[LIST OF TYPE 1 FIELDS] WHEN NOT MATCHED THEN INSERT [LIST OF ALL FIELDS] VALUES Source.[LIST OF ALL FIELDS]
The text in black is the skeleton of the statement, with the text in red being the details specific to the dimension. It’s these red items which we need to retrieve from the metadata of the dimension in order to create the full merge statement.
We can retrieve the extended properties using the sys.extended_properties DMV. This allows us to pull out a list of all fields which have a specific extended property set, e.g. all PK fields, all BK fields, all type 2 fields etc. etc. If we then put a few of these queries into cursors, we can loop through them and build up a dynamic SQL query. Yes I know, dynamic SQL should be avoided and is evil etc., however… this use is an exception and does truly make the World a better place.
I’m not going to explain the resulting proc in minute detail, so instead please just download it here and work through it yourself. I will however explain a couple of items which are pretty important:
It’s important to keep the naming convention of your dimensions consistent. This doesn’t mean that every dimension must be identical, some may need inferred member support, some may need type 2 tracking fields (e.g. IsRowCurrent) and some may not; the critical thing is that all of your fields, if they do exist, should be named consistently. The automation proc can then look for specific field names and include them in the merge statement if necessary.
There is a parameter in the proc called @Execute. This offers the possibility of either executing the resulting merge statement directly, or just printing out the statement. If you only want to use this to automate the development process then this allows you to do just that, you can then just copy and paste the resulting statement into SSIS or into a stored proc.
The automated generation of T-SQL merge statement to handle type 0 & 1 SCDs!
Hopefully you can see how you can expand this to also cope with Type 2 SCDs, following the structure in my earlier posts.
Download the SQL scripts here
In my last blog post I showed the basic concepts of using the T-SQL Merge statement, available in SQL Server 2008 onwards.
In this post we’ll take it a step further and show how we can use it for loading data warehouse dimensions, and managing the SCD (slowly changing dimension) process. Before we start, let’s have a quick catch up on what an SCD is…
What is a Slowly Changing Dimension (SCD)?
If you want a full explanation of slowly changing dimensions then you’ve come to the wrong place, I’m assuming a moderate level of experience of SCDs here, check out Wikipedia for some background, but in short, they manage the tracking of attribute history in dimensional data warehouses.
Most data warehouses contain type 0, 1 and 2 SCDs, so we’ll cope with those for now.
- Type 0 – Ignore updates
- Type 1 – Only keep latest version
- Type 2 – Track history by creating a new row
Type 2 is commonly stored in a fashion similar to this.
Both records show the same customer but in this case Jane got married and changed her name & title. We terminate the old record by setting IsRowCurrent=0 and create a new record with the new details. Each row also contains ValidFrom and ValidTo dates which allow us to identify the correct record for a particular point in time.
That’s enough of that, let’s get on with doing this using Merge
Using Merge to load SCD
The first stage is to save the output rows from the ETL process to a staging table. We can then use Merge to process these into the live dimension.
We saw in the previous post how to either insert or update a record depending on whether it already exists. We can start with this and enhance as we go. First lets figure out what logic we want to perform
- If the record doesn’t exist, create it
- If the record does exist
- Type 0 fields – ignore
- Type 1 fields – update fields
- Type 2 fields – terminate existing record, insert a new record
- If the record exists in the dimension, but not in the updated source file – terminate record
The last option is rarely used in my experience, as it only works when you perform a full load of the dimension every time. It’s more common to process an incremental load, but I’ve included it here for completeness.
The main difference here, over a basic upsert, is the handling of type 2s; we need to perform two separate operations on the dimension for every incoming record. Firstly we terminate the existing row then we have to insert a new row.
The T-SQL Merge statement can only update a single row per incoming row, but there’s a trick that we can take advantage of by making use of the OUTPUT clause. Merge can output the results of what it has done, which in turn can be consumed by a separate INSERT statement.
We’ll therefore use the MERGE statement to update the existing record, terminating it, and then pass the relevant source rows out to the INSERT statement to create the new row.
Let’s look at an example. Download the code here which will create the necessary tables and data to work on.
Main Merge Statement
We’ll start with a statement very similar to the previous post, with only a couple of minor amendments:
- We include IsRowCurrent into the joining clause. We only ever want to update the current records, not the history.
- DoB is removed from the WHEN MATCHED clause. We’re going to treat DoB as a type 1 change, if it’s updated then we assume it’s a correction rather than a new date of birth which should be tracked. We’ll deal with this Type 1 later on
- The UPDATE statement in the WHEN MATCHED clause doesn’t change the fields, only terminates the row by setting the IsRowCurrent and ValidTo fields (as well as LastUpdated)
MERGE Customer AS [Target] USING StagingCustomer AS [Source] ON Target.Email = Source.Email AND Target.IsRowCurrent = 1 WHEN MATCHED AND ( Target.FirstName <> Source.FirstName OR Target.LastName <> Source.LastName OR Target.Title <> Source.Title ) THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE() WHEN NOT MATCHED BY TARGET THEN INSERT ( FirstName ,LastName ,Title ,DoB ,Email ,LastUpdated ,IsRowCurrent ,ValidFrom ,ValidTo ) VALUES ( Source.FirstName ,Source.LastName ,Source.Title ,Source.DoB ,Source.Email ,GETDATE() --LastUpdated ,1 --IsRowCurrent ,GETDATE() --ValidFrom ,'9999-12-31' --ValidTo ) WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1 THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE()
The ‘When Matched’ section includes extra clauses which define which fields should be treated as Type 2.
The ‘When Not Matched By Target’ section deals with inserting the new records which didn’t previously exist.
The ‘When Not Matched By Source’ section deals with terminating records which are no longer received from the source. Usually this section can be deleted, especially if the data is received incrementally.
*** UPDATE *** Thank you to Sergey (in the comments below) for pointing out an error in this code. I’ve now corrected the ‘WHEN NOT MATCHED BY SOURCE’ line to include ‘AND Target.IsRowCurrent=1’. If this is omitted then all historic (IsRowCurrent=0) records are always updated with today’s date. We only want to terminate current records, not already terminated records.
We then add an OUTPUT clause to the end of the statement
OUTPUT $action AS Action ,Source.*
The OUTPUT clause tells MERGE to generate an output dataset. This can consist of any of the Source table’s fields or the Target table’s fields. We can also specify $Action as an extra field which will identify, for each row, whether
it was dealt with via an INSERT, UPDATE or DELETE. For this purpose we only care about the UPDATES, so we’ll use this to filter the records later on. We also only need the Source data, not the Target, so we’ll return Source.*
We wrap this up within an INSERT statement which will insert the new record for the changed dimension member.
INSERT INTO Customer ( FirstName ,LastName ,Title ,DoB ,Email ,LastUpdated ,IsRowCurrent ,ValidFrom ,ValidTo ) SELECT FirstName ,LastName ,Title ,DoB ,Email ,GETDATE() --LastUpdated ,1 --IsRowCurrent ,GETDATE() --ValidFrom ,'9999-12-31' --ValidTo FROM ( MERGE Customer AS [Target] USING StagingCustomer AS [Source] ON Target.Email = Source.Email AND Target.IsRowCurrent = 1 WHEN MATCHED AND ( Target.FirstName <> Source.FirstName OR Target.LastName <> Source.LastName OR Target.Title <> Source.Title ) THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE() WHEN NOT MATCHED BY TARGET THEN INSERT ( FirstName ,LastName ,Title ,DoB ,Email ,LastUpdated ,IsRowCurrent ,ValidFrom ,ValidTo ) VALUES ( Source.FirstName ,Source.LastName ,Source.Title ,Source.DoB ,Source.Email ,GETDATE() --LastUpdated ,1 --IsRowCurrent ,GETDATE() --ValidFrom ,'9999-12-31' --ValidTo ) WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1 THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE() OUTPUT $action AS Action ,[Source].* ) AS MergeOutput WHERE MergeOutput.Action = 'UPDATE' AND Email IS NOT NULL ;
Note that the output clause is restricted so we only return the ‘UPDATE’ rows. As we’re using the email field as the business key, we should also ensure that we only insert records which have a valid email address.
So Type 2 changes have now been dealt with, by terminating the old version of the record and inserting the new version. Type 0 fields are just left out of the entire process, so are taken care of by just ignoring them. Therefore the only thing left is to manage the Type 1 fields.
We have two options here;
- Update all historical records to the new value
- Update only the current record to the new value
These are obviously only valid when there is a mix of type 1 and 2 attributes. If we’re just looking at Type 1 then there will be no historical records. In a true Type 1 scenario the first option is correct. All history (of Type 1 fields) is lost. The second option can be a valid option when it would be beneficial to keep a limited history of Type 1 fields.
This would mean that historical records created by Type 2 changes also keep a record of the Type 1 attribute values that were valid at the time the record was terminated. It doesn’t keep a full history of Type 1 attributes but sometimes this can be useful.
UPDATE C SET DoB = SC.DoB ,LastUpdated = GETDATE() FROM Customer C INNER JOIN StagingCustomer SC ON C.Email = SC.Email --AND C.IsRowCurrent = 1 --Optional AND C.DoB <> SC.DoB
This block of code updates the Type 1 attributes (in this case, DoB). The line 7 (the IsRowCurrent) check is optional depending on whether you only want to update current or all records.
So in one SQL statement we’ve managed the entire load process of all Type 2 SCDs, and with one more we’ve also managed all Type 1 fields.
I’ve been performing a large number of performance tests on loading Type 2s using various methods (another blog post to follow, as well as a talk that I’ll be presenting at SQL Bits X), and the performance of this method is very fast. In fact there’s very little difference in performance between using this method and using the SSIS Merge Join component.
This is now my preferred approach to loading Type 2 SCDs, slightly faster methods may be available, but as we’ll see in later blog posts, this is such a quick method to implement, as well as being incredibly flexible as it can be controlled entirely from metadata.
Long live the Merge statement!
Update: 9th Nov 2019
Added a follow-up post to demonstrate how to capture insert/update/delete counts as outputs from the merge statement, which you can read here.
PowerPivot Vs QlikView 101 – SQLBits Video
The video is now available from my PowerPivot and QlikView talk at SQLBits 9 in Liverpool in September 2011. You can download or watch the video here.
In this 1 hour session I create interactive dashboards from scratch in both PowerPivot and QlikView, showing how to set up the data model, overcome common pitfalls and build the dashboards. I create almost identical dashboards in both systems and highllight the pros and cons of each system.
You can find the scripts and code to go with this video in this blog post
SQLBits 10 is coming!!
Today the SQLBits organisers have announced that SQLBits 10 will be held in London between 29th – 31st March 2012 in the Novotel London West. It’s going to be even bigger and better, so keep an eye on the website and get your tickets early.
Thank you to the SQLBits committee, speakers, helpers and attendees, for making SQLBits 9 one of the best yet. What a great example SQLBits is of the power of the SQL Server community coming together to help each other, and enjoy a few beers in the process!
I was delighted to have my “PowerPivot & QlikView 101” session chosen by the committee, which I presented on Saturday afternoon. Thank you to all those who attended, great to see such a good number there. If you haven’t yet submitted your feedback for my session, and the event as a whole, please do take the time to do it. It helps everyone improve the quality of the event for next time.
For those who atttended my session, here are the slides, with a summary of the results and code etc.
QlikView Script: Download the QlikView Script here
PowerPivot Workboook: Download the PowerPivot workbook here
Session Video: The session recording isn’t yet available, I’ll post a link here when it’s ready.
Sales Amount Visual Cue for the green shading
Upper: (SUM(TOTAL SalesAmount) / COUNT(TOTAL SalesTerritoryRegion))*1.2
Lower: (SUM(TOTAL SalesAmount) / COUNT(TOTAL SalesTerritoryRegion))*0.8
Sales Amount blue bar Guage maximum
Sales Amount Year to Date
Sales Amount Previous Year
PowerPivot DAX Expressions
Sales Amount Year to Date
=TOTALYTD(SUM(FactInternetSales[SalesAmount]), DimOrderDate[FullDateAlternateKey], ALL(DimOrderDate))
Sales Amount Previous Year
=CALCULATE(SUM(FactInternetSales[SalesAmount]), SAMEPERIODLASTYEAR(DimOrderDate[FullDateAlternateKey]), ALL(DimOrderDate))
If you have any questions about any of this, please get in touch.
It only seems like yesterday since the SQLBits crew put on their awesome 8th conference in Brighton. The next conference has just been announced, SQLBits will move to Liverpool for SQLBits 9 – “Query Across The Mersey” (…Really?!) between September 29th and October 1st.
Registration’s not yet open, but you can find out more at www.SQLBits.com
I’ve already submitted two sessions to present, “PowerPivot & QlikView 101” and “Advanced data modelling for data warehousing and cubes”. Make sure you vote for them when voting opens! You can read full summaries of all sessions submitted so far here.
SQL User Groups
We’ve also got a busy time coming up with the UK SQL Server user groups. I’ll be running the next Birmingham user group the week after SQLBits, on Monday 3rd October. The agenda and speakers have yet to be set, but it should be a goodun!
I’ll also be speaking at the Southampton user group in 2 weeks time, on Wednesday 6th July. I’ll be presenting my 3 part blog post on automating SSAS cube documentation. Matt Whitfield will also be presenting, so it should be a good night. Now stop reading this and go and register!!