In my last blog post I showed the basic concepts of using the T-SQL Merge statement, available in SQL Server 2008 onwards.
In this post we’ll take it a step further and show how we can use it for loading data warehouse dimensions, and managing the SCD (slowly changing dimension) process. Before we start, let’s have a quick catch up on what an SCD is…
What is a Slowly Changing Dimension (SCD)?
If you want a full explanation of slowly changing dimensions then you’ve come to the wrong place, I’m assuming a moderate level of experience of SCDs here, check out Wikipedia for some background, but in short, they manage the tracking of attribute history in dimensional data warehouses.
Most data warehouses contain type 0, 1 and 2 SCDs, so we’ll cope with those for now.
- Type 0 – Ignore updates
- Type 1 – Only keep latest version
- Type 2 – Track history by creating a new row
Type 2 is commonly stored in a fashion similar to this.
Both records show the same customer but in this case Jane got married and changed her name & title. We terminate the old record by setting IsRowCurrent=0 and create a new record with the new details. Each row also contains ValidFrom and ValidTo dates which allow us to identify the correct record for a particular point in time.
That’s enough of that, let’s get on with doing this using Merge
Using Merge to load SCD
The first stage is to save the output rows from the ETL process to a staging table. We can then use Merge to process these into the live dimension.
We saw in the previous post how to either insert or update a record depending on whether it already exists. We can start with this and enhance as we go. First lets figure out what logic we want to perform
- If the record doesn’t exist, create it
- If the record does exist
- Type 0 fields – ignore
- Type 1 fields – update fields
- Type 2 fields – terminate existing record, insert a new record
- If the record exists in the dimension, but not in the updated source file – terminate record
The last option is rarely used in my experience, as it only works when you perform a full load of the dimension every time. It’s more common to process an incremental load, but I’ve included it here for completeness.
The main difference here, over a basic upsert, is the handling of type 2s; we need to perform two separate operations on the dimension for every incoming record. Firstly we terminate the existing row then we have to insert a new row.
The T-SQL Merge statement can only update a single row per incoming row, but there’s a trick that we can take advantage of by making use of the OUTPUT clause. Merge can output the results of what it has done, which in turn can be consumed by a separate INSERT statement.
We’ll therefore use the MERGE statement to update the existing record, terminating it, and then pass the relevant source rows out to the INSERT statement to create the new row.
Let’s look at an example. Download the code here which will create the necessary tables and data to work on.
Main Merge Statement
We’ll start with a statement very similar to the previous post, with only a couple of minor amendments:
- We include IsRowCurrent into the joining clause. We only ever want to update the current records, not the history.
- DoB is removed from the WHEN MATCHED clause. We’re going to treat DoB as a type 1 change, if it’s updated then we assume it’s a correction rather than a new date of birth which should be tracked. We’ll deal with this Type 1 later on
- The UPDATE statement in the WHEN MATCHED clause doesn’t change the fields, only terminates the row by setting the IsRowCurrent and ValidTo fields (as well as LastUpdated)
MERGE Customer AS [Target] USING StagingCustomer AS [Source] ON Target.Email = Source.Email AND Target.IsRowCurrent = 1 WHEN MATCHED AND ( Target.FirstName <> Source.FirstName OR Target.LastName <> Source.LastName OR Target.Title <> Source.Title ) THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE() WHEN NOT MATCHED BY TARGET THEN INSERT ( FirstName ,LastName ,Title ,DoB ,Email ,LastUpdated ,IsRowCurrent ,ValidFrom ,ValidTo ) VALUES ( Source.FirstName ,Source.LastName ,Source.Title ,Source.DoB ,Source.Email ,GETDATE() --LastUpdated ,1 --IsRowCurrent ,GETDATE() --ValidFrom ,'9999-12-31' --ValidTo ) WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1 THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE()
The ‘When Matched’ section includes extra clauses which define which fields should be treated as Type 2.
The ‘When Not Matched By Target’ section deals with inserting the new records which didn’t previously exist.
The ‘When Not Matched By Source’ section deals with terminating records which are no longer received from the source. Usually this section can be deleted, especially if the data is received incrementally.
*** UPDATE *** Thank you to Sergey (in the comments below) for pointing out an error in this code. I’ve now corrected the ‘WHEN NOT MATCHED BY SOURCE’ line to include ‘AND Target.IsRowCurrent=1’. If this is omitted then all historic (IsRowCurrent=0) records are always updated with today’s date. We only want to terminate current records, not already terminated records.
We then add an OUTPUT clause to the end of the statement
OUTPUT $action AS Action ,Source.*
The OUTPUT clause tells MERGE to generate an output dataset. This can consist of any of the Source table’s fields or the Target table’s fields. We can also specify $Action as an extra field which will identify, for each row, whether
it was dealt with via an INSERT, UPDATE or DELETE. For this purpose we only care about the UPDATES, so we’ll use this to filter the records later on. We also only need the Source data, not the Target, so we’ll return Source.*
We wrap this up within an INSERT statement which will insert the new record for the changed dimension member.
INSERT INTO Customer ( FirstName ,LastName ,Title ,DoB ,Email ,LastUpdated ,IsRowCurrent ,ValidFrom ,ValidTo ) SELECT FirstName ,LastName ,Title ,DoB ,Email ,GETDATE() --LastUpdated ,1 --IsRowCurrent ,GETDATE() --ValidFrom ,'9999-12-31' --ValidTo FROM ( MERGE Customer AS [Target] USING StagingCustomer AS [Source] ON Target.Email = Source.Email AND Target.IsRowCurrent = 1 WHEN MATCHED AND ( Target.FirstName <> Source.FirstName OR Target.LastName <> Source.LastName OR Target.Title <> Source.Title ) THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE() WHEN NOT MATCHED BY TARGET THEN INSERT ( FirstName ,LastName ,Title ,DoB ,Email ,LastUpdated ,IsRowCurrent ,ValidFrom ,ValidTo ) VALUES ( Source.FirstName ,Source.LastName ,Source.Title ,Source.DoB ,Source.Email ,GETDATE() --LastUpdated ,1 --IsRowCurrent ,GETDATE() --ValidFrom ,'9999-12-31' --ValidTo ) WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1 THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE() OUTPUT $action AS Action ,[Source].* ) AS MergeOutput WHERE MergeOutput.Action = 'UPDATE' AND Email IS NOT NULL ;
Note that the output clause is restricted so we only return the ‘UPDATE’ rows. As we’re using the email field as the business key, we should also ensure that we only insert records which have a valid email address.
So Type 2 changes have now been dealt with, by terminating the old version of the record and inserting the new version. Type 0 fields are just left out of the entire process, so are taken care of by just ignoring them. Therefore the only thing left is to manage the Type 1 fields.
We have two options here;
- Update all historical records to the new value
- Update only the current record to the new value
These are obviously only valid when there is a mix of type 1 and 2 attributes. If we’re just looking at Type 1 then there will be no historical records. In a true Type 1 scenario the first option is correct. All history (of Type 1 fields) is lost. The second option can be a valid option when it would be beneficial to keep a limited history of Type 1 fields.
This would mean that historical records created by Type 2 changes also keep a record of the Type 1 attribute values that were valid at the time the record was terminated. It doesn’t keep a full history of Type 1 attributes but sometimes this can be useful.
UPDATE C SET DoB = SC.DoB ,LastUpdated = GETDATE() FROM Customer C INNER JOIN StagingCustomer SC ON C.Email = SC.Email --AND C.IsRowCurrent = 1 --Optional AND C.DoB <> SC.DoB
This block of code updates the Type 1 attributes (in this case, DoB). The line 7 (the IsRowCurrent) check is optional depending on whether you only want to update current or all records.
So in one SQL statement we’ve managed the entire load process of all Type 2 SCDs, and with one more we’ve also managed all Type 1 fields.
I’ve been performing a large number of performance tests on loading Type 2s using various methods (another blog post to follow, as well as a talk that I’ll be presenting at SQL Bits X), and the performance of this method is very fast. In fact there’s very little difference in performance between using this method and using the SSIS Merge Join component.
This is now my preferred approach to loading Type 2 SCDs, slightly faster methods may be available, but as we’ll see in later blog posts, this is such a quick method to implement, as well as being incredibly flexible as it can be controlled entirely from metadata.
Long live the Merge statement!
The Cloud is pretty much top of the list of the corporate buzzword bingo phrases at the moment. It’s clear what the benefits are from a SAAS perspective, but it’s been less clear how the Business Intelligence world will start to make use of it.
Data Warehousing, Management Information, Cubes, Data Mining etc. all involve a lot of data, usually gigabytes or terrabytes even for medium sized organisations. Internet connectivity (speed and reliability) is obviously going to be a major concern and limiting factor however we can assume that connectivity will continue to improve dramatically over the next few years.
Security is a major concern; when you combine all of your company’s information into a single place (accounts, sales, customers, etc.) it raises a large question of how the security is managed when you let this information out of your firewalled LAN and host it on the internet.
It will of course bring a number of benefits incuding reduced infrastructure/capital costs, improved reliability and redundancy of hardware, increased accessibility for the mobile workforce etc. etc., and these should not be overlooked as they will no doubt provide a large incentive for some companies.
I’ve no doubt that the cloud is coming to BI, but I’d think carefully about what actual benefits it brings and how they trade off against any problems it may also bring.
There’s an interesting webcast panel discussion about BI and Cloud Computing below, including (among others) Donald Farmer from Microsoft discussing some of the issues above.
Sourced From: BizIntelligence.tv on blogs.msdn.com
The cloud is coming…
Thank you to the BCS Shropshire branch for opening your doors to the Purple Frog team. We’re delighted to be given any chance to talk about Business Intelligence, and spread the word to increase awareness of what it is and how it all works. We find it a fascinating world to work in, and hope that there were some converts in the audience!
For those that asked, here’s a copy of the slides for you to download. As the presentation was largely demonstration based, we’ve taken some screenshots of the key points of the demo and included them within the slides. We did try and take a video of the event, but unfortunately the camera didn’t want to play ball so we’ll have to stick with the slides.
We’re always looking for other groups that are interested in learning about BI, and are more than happy to run talks (techy or not) so let us know if you’re interested.
On Monday 20th September, Alex & Hollie from Purple Frog Systems will be giving a talk on business intelligence to the Shropshire branch of the British Computer Society.
The event is free to attend, even for non BCS members, and will be held from 6.15pm at the Telford campus of Wolverhampton Uni.
We’ll be providing an introduction to BI, explaining what it is, how it works and how it can benefit your organisation.
We’ll present a fully working demonstration of how to turn a simple list of customers/members into a highly interactive information system, using the latest in spatial mapping techniques and OLAP cubes.
Register for free at the BCS events website
Anyone involved in the Business Intelligence space has had their head in the sand if they are not aware of the long running, and more often than not misunderstood, debate between the two conceptual models of data warehouse design.
Bill Inmon has recently posted an article on www.b-eye-network.com discussing the matter, and to his credit, has tried to put forward a number of balanced pros and cons of each methodology.
I’ll state now that I’m a big advocate of a hybrid approach, taking elements from both Imnon and Kimball camps and selecting the right approach for each unique project depending on the requirements and purpose of the warehouse. I therefore appreciate both sides of the debate, and am not going to jump to the defence of either side. Having said that, most projects more often than not have a weighting towards Kimball due to the time pressures imposed by clients.
Bill nicely simmarises the key elements of each approach as:
The Kimball approach to database design and development is typified by the star schema design of databases. There are fact tables and dimension tables. In a complex environment, there are snowflake structures, which are merely extended versions of the star schema. In order to resolve differences of granularity between fact tables, conformed dimensions are used. Staging areas are occasionally used to capture raw data before the placement of the data into a Kimball style data mart.
The Inmon approach to data warehousing centers around a relational, non redundant, granular, integrated design of detailed data. From this base of data, data marts are spun off to different departments according to their individual analytical needs. In recent vintage, with DW 2.0, the Inmon approach calls for the recognition of the life cycle of data within the data warehouse, the inclusion and integration of unstructured data within the data warehouse, and the close integration of metadata into the data warehouse infrastructure.
He then proceeds to present a brief comparitive assessment of the pros and cons of each. I don’t entirely agree with the black and white nature of the comparisons as most items are a shade of grey in both camps, but it certainly provides a good starting point for those that are starting out in the BI field and want to know more about whhat this debate is alll about.
Possibly the most interesting item (certainly from a business intelligence consultant’s perspective) is Bill’s renewed call for an open, public debate between Bill and Ralph – It gets my vote!
Thanks to Graham Bradfield at Computer People for pointing me towards this article in his BI newsletter.
View the full article here: Data Warehousing: Kimball vs Inmon
Leap before you can land…
Research scientists at the Southern Illinois University have concluded that primitive frogs learned to jump before they learned to land. The result being a rather inelegant crash landing. I’m sure the frogs were happy that they achieved their goal of getting from A to B, but I do feel sorry for them having to endure the presumably painful landing, not to mention the indignity of having the rest of the pond witness the proceedings!
What struck me about this story, aside from the obvious frog reference (albeit not purple!), was the parallel to Business Intelligence projects that we encounter time and time again. BI systems that are put in place, but whilst the project teams are basking in the glory of accomplishing their goal, often sporting a couple of bruises picked up along the way, the wider business looks on not really knowing what has just happened or understanding how the company or their department are going to benefit from this leap in technology.
Although BI, as both a business function and technical toolset, has long since reached maturity, its adoption within SMEs is still surprisingly in its youth. Most companies who embark on a BI project seem to do so either because they have been sold on a fantastic vision of the future of information management, or because they have a specific business need to solve. Whatever the reasons for initiating a BI project, there is often very little thought given to how this new information system is likely to impact the wider business, or how the system can be utilised to provide far more benefit than the original project scope called for.
There are two aspects to this impact, the pain and the gain.
When the company’s data is cleaned, consolidated and remodelled to make reporting and analysis more simple (one of the primary goals of a BI project), the tendency is to use that system to reproduce the company’s old static reports but in a slightly more glossy fashion, only faster and more accurate. This stops far short of what a good data warehouse or cube is capable of. Enter self-service BI.
Self-service BI essentially allows your employees to have a level of (secured) access to the underlying BI data, combined with the right tools to perform their own analysis. This enables them to not just consume their old static reports, but to pro-actively go digging for information, knowledge and a greater understanding of your company. Gone are the days when requiring a change to a report has to be documented, requested, specified, programmed, tested & released. A well designed data warehouse and cube, along with the right client tools (often just Excel), enables a user to just do it themselves with the bare minimum of training (if any), whilst retaining the confidence that the resulting report is accurate; the warehouse/cube take care of ensuring the underlying business logic is right.
A data warehouse is, by definition, a combined set of data from a variety of source systems. This means that any change to the source systems must be assessed for their impact on the warehouse. After putting a sparkly new BI infrastructure in place, you don’t want it to fall over because someone changed the structure of a source database or extract file! This adds a layer of change control that may not have existed before, and can cause some headaches.
There is usually a moderately large requirement for business users’ time, both to help the BI architect to understand and model the business properly, and also in testing.
Pain but no gain?
So, if the benefits such as self service BI (with dashboards, data mining, KPIs, PowerPivot etc. etc) are not adopted and utilised, the business is left having made an impressive technology leap, but is then stuck with the impact on source systems, a bill, and confusion from staff who don’t see the benefit of getting their old reports via a different method.
Evolution taught the frog that by landing properly, he can quickly jump again to go even further, and avoid the pain in the meantime.
Following the launch of SQL Server 2008 R2 this week, the ever dedicated SQL Bits team are putting on the next installment tomorrow with SQL Bits VI – ‘the Sixth Sets’. The Purple Froggers will of course be there, hopefully in time for a pre-conference bacon butty!
Also worth mentioning is ‘Microsoft Solutions for Business Intelligence’, a seminar that Microsoft are holding at their Reading campus on May 12 2010. Rafal Lukewiecki is speaking, anyone who has seen him before will know what an entertaining and knowlegable presenter he is. Whether you are currently implementing a BI solution or are just ‘BI curious’, a technical developer or a business manager, the day will provide an insight into the direction BI is heading in, and what we can expect in the years to come.
We hope to see you at either/both!
Purple Frog is presenting a session on data warehouse design concepts at the South Wales SQL Server User Group on Thursday 25th February 2010.
If you’re in the area and want to come along you can register for free here. Eversheds are hosting the event in their Cardiff office.
Adam Morton will be demonstrating an ETL control framework in SSIS, and Alex Whittles will be discussing the concepts of data warehousing and the fundamental differences in architecture between OLTP and OLAP systems.
Hope to see you there!
For any users of Analysis Services, if you haven’t already downloaded the Excel (2002/2003) addin you’re missing out.
It’s a free download from Microsoft which significantly expands Excel’s cube querying ability. Well recommended!
It really wouldn’t be fair to kick off the Frog-Blog without a shout out to Chris Hays and his superb ‘Sleazy Hacks’ site. If you want to push SQL Server Reporting Services further than anyone else, then Chris will definately have something of use to you. He designed the RDL report language – the guy knows what he’s talking about.