Well after 3.5 years, I’ve finally completed my MSc Business Intelligence – hoorah! And to reward the time, effort and increased grey hair, they saw fit to give me a merit as well.
During the last year I’ve been writing a thesis investigating the performance characteristics of loading data into data warehouse dimensions. Specifically loading Type 2 SCDs using SSIS.
For those who have followed the previous posts and my conference talks on using T-SQL Merge for the purpose, you won’t be surprised at the direction of the dissertation, but it provides a useful performance comparison between T-SQL Merge, SSIS Merge Join, SSIS Lookup and the SSIS SCD Wizard.
I won’t go into the full details here of the project or results, but will show a couple of the summary charts which are of most interest. You can download the full project here:
- PDF: Performance comparison of techniques to load Type 2 slowly changing dimensions in a Kimball style data warehouse using SSIS
The charts below shows the duration taken for the Lookup, Merge and Merge-Join methods (SCD Wizard excluded for obvious reasons!).
The top chart shows the performance on a Raid 10 array of traditional hard disks.
The second chart shows the same tests run on a Fusion IO NAND flash card.
The charts clearly show that the Lookup method is the least favoured. Of the other two, Merge is [just] preferred when using solid state, although statistically they are equivalent. On HDDs, Merge and Merge-Join are equivalent until you’re loading 2-3m rows per batch, at which point Merge-Join becomes the preferred option.
Full test results and analysis in the PDF download above.
My previous few posts show how using a T-SQL approach like Merge can provide huge development benefits by automating the code. This research now shows that unless you’re loading very large data volumes the performance is equivalent to more traditional approaches.
Hope this is of use. If you want to know a bit more without reading the full 99 pages & 23k words (who could blame you?!), then my SQLBits talk video is now on-line here. This talk is slightly out of date as it was presented before I’d finished the research and analysis, but it’s largely accurate. I presented a more up to date version on a webinar for the PASS Virtual BI chapter. The recording isn’t currently available [When this post was written] but should be up soon. Keep checking on the BI PASS Chapter website.
In my last blog post I showed the basic concepts of using the T-SQL Merge statement, available in SQL Server 2008 onwards.
In this post we’ll take it a step further and show how we can use it for loading data warehouse dimensions, and managing the SCD (slowly changing dimension) process. Before we start, let’s have a quick catch up on what an SCD is…
What is a Slowly Changing Dimension (SCD)?
If you want a full explanation of slowly changing dimensions then you’ve come to the wrong place, I’m assuming a moderate level of experience of SCDs here, check out Wikipedia for some background, but in short, they manage the tracking of attribute history in dimensional data warehouses.
Most data warehouses contain type 0, 1 and 2 SCDs, so we’ll cope with those for now.
- Type 0 – Ignore updates
- Type 1 – Only keep latest version
- Type 2 – Track history by creating a new row
Type 2 is commonly stored in a fashion similar to this.
Both records show the same customer but in this case Jane got married and changed her name & title. We terminate the old record by setting IsRowCurrent=0 and create a new record with the new details. Each row also contains ValidFrom and ValidTo dates which allow us to identify the correct record for a particular point in time.
That’s enough of that, let’s get on with doing this using Merge
Using Merge to load SCD
The first stage is to save the output rows from the ETL process to a staging table. We can then use Merge to process these into the live dimension.
We saw in the previous post how to either insert or update a record depending on whether it already exists. We can start with this and enhance as we go. First lets figure out what logic we want to perform
- If the record doesn’t exist, create it
- If the record does exist
- Type 0 fields – ignore
- Type 1 fields – update fields
- Type 2 fields – terminate existing record, insert a new record
- If the record exists in the dimension, but not in the updated source file – terminate record
The last option is rarely used in my experience, as it only works when you perform a full load of the dimension every time. It’s more common to process an incremental load, but I’ve included it here for completeness.
The main difference here, over a basic upsert, is the handling of type 2s; we need to perform two separate operations on the dimension for every incoming record. Firstly we terminate the existing row then we have to insert a new row.
The T-SQL Merge statement can only update a single row per incoming row, but there’s a trick that we can take advantage of by making use of the OUTPUT clause. Merge can output the results of what it has done, which in turn can be consumed by a separate INSERT statement.
We’ll therefore use the MERGE statement to update the existing record, terminating it, and then pass the relevant source rows out to the INSERT statement to create the new row.
Let’s look at an example. Download the code here which will create the necessary tables and data to work on.
Main Merge Statement
We’ll start with a statement very similar to the previous post, with only a couple of minor amendments:
- We include IsRowCurrent into the joining clause. We only ever want to update the current records, not the history.
- DoB is removed from the WHEN MATCHED clause. We’re going to treat DoB as a type 1 change, if it’s updated then we assume it’s a correction rather than a new date of birth which should be tracked. We’ll deal with this Type 1 later on
- The UPDATE statement in the WHEN MATCHED clause doesn’t change the fields, only terminates the row by setting the IsRowCurrent and ValidTo fields (as well as LastUpdated)
MERGE Customer AS [Target] USING StagingCustomer AS [Source] ON Target.Email = Source.Email AND Target.IsRowCurrent = 1 WHEN MATCHED AND ( Target.FirstName <> Source.FirstName OR Target.LastName <> Source.LastName OR Target.Title <> Source.Title ) THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE() WHEN NOT MATCHED BY TARGET THEN INSERT ( FirstName ,LastName ,Title ,DoB ,Email ,LastUpdated ,IsRowCurrent ,ValidFrom ,ValidTo ) VALUES ( Source.FirstName ,Source.LastName ,Source.Title ,Source.DoB ,Source.Email ,GETDATE() --LastUpdated ,1 --IsRowCurrent ,GETDATE() --ValidFrom ,'9999-12-31' --ValidTo ) WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1 THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE()
The ‘When Matched’ section includes extra clauses which define which fields should be treated as Type 2.
The ‘When Not Matched By Target’ section deals with inserting the new records which didn’t previously exist.
The ‘When Not Matched By Source’ section deals with terminating records which are no longer received from the source. Usually this section can be deleted, especially if the data is received incrementally.
*** UPDATE *** Thank you to Sergey (in the comments below) for pointing out an error in this code. I’ve now corrected the ‘WHEN NOT MATCHED BY SOURCE’ line to include ‘AND Target.IsRowCurrent=1’. If this is omitted then all historic (IsRowCurrent=0) records are always updated with today’s date. We only want to terminate current records, not already terminated records.
We then add an OUTPUT clause to the end of the statement
OUTPUT $action AS Action ,Source.*
The OUTPUT clause tells MERGE to generate an output dataset. This can consist of any of the Source table’s fields or the Target table’s fields. We can also specify $Action as an extra field which will identify, for each row, whether
it was dealt with via an INSERT, UPDATE or DELETE. For this purpose we only care about the UPDATES, so we’ll use this to filter the records later on. We also only need the Source data, not the Target, so we’ll return Source.*
We wrap this up within an INSERT statement which will insert the new record for the changed dimension member.
INSERT INTO Customer ( FirstName ,LastName ,Title ,DoB ,Email ,LastUpdated ,IsRowCurrent ,ValidFrom ,ValidTo ) SELECT FirstName ,LastName ,Title ,DoB ,Email ,GETDATE() --LastUpdated ,1 --IsRowCurrent ,GETDATE() --ValidFrom ,'9999-12-31' --ValidTo FROM ( MERGE Customer AS [Target] USING StagingCustomer AS [Source] ON Target.Email = Source.Email AND Target.IsRowCurrent = 1 WHEN MATCHED AND ( Target.FirstName <> Source.FirstName OR Target.LastName <> Source.LastName OR Target.Title <> Source.Title ) THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE() WHEN NOT MATCHED BY TARGET THEN INSERT ( FirstName ,LastName ,Title ,DoB ,Email ,LastUpdated ,IsRowCurrent ,ValidFrom ,ValidTo ) VALUES ( Source.FirstName ,Source.LastName ,Source.Title ,Source.DoB ,Source.Email ,GETDATE() --LastUpdated ,1 --IsRowCurrent ,GETDATE() --ValidFrom ,'9999-12-31' --ValidTo ) WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1 THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE() OUTPUT $action AS Action ,[Source].* ) AS MergeOutput WHERE MergeOutput.Action = 'UPDATE' AND Email IS NOT NULL ;
Note that the output clause is restricted so we only return the ‘UPDATE’ rows. As we’re using the email field as the business key, we should also ensure that we only insert records which have a valid email address.
So Type 2 changes have now been dealt with, by terminating the old version of the record and inserting the new version. Type 0 fields are just left out of the entire process, so are taken care of by just ignoring them. Therefore the only thing left is to manage the Type 1 fields.
We have two options here;
- Update all historical records to the new value
- Update only the current record to the new value
These are obviously only valid when there is a mix of type 1 and 2 attributes. If we’re just looking at Type 1 then there will be no historical records. In a true Type 1 scenario the first option is correct. All history (of Type 1 fields) is lost. The second option can be a valid option when it would be beneficial to keep a limited history of Type 1 fields.
This would mean that historical records created by Type 2 changes also keep a record of the Type 1 attribute values that were valid at the time the record was terminated. It doesn’t keep a full history of Type 1 attributes but sometimes this can be useful.
UPDATE C SET DoB = SC.DoB ,LastUpdated = GETDATE() FROM Customer C INNER JOIN StagingCustomer SC ON C.Email = SC.Email --AND C.IsRowCurrent = 1 --Optional AND C.DoB <> SC.DoB
This block of code updates the Type 1 attributes (in this case, DoB). The line 7 (the IsRowCurrent) check is optional depending on whether you only want to update current or all records.
So in one SQL statement we’ve managed the entire load process of all Type 2 SCDs, and with one more we’ve also managed all Type 1 fields.
I’ve been performing a large number of performance tests on loading Type 2s using various methods (another blog post to follow, as well as a talk that I’ll be presenting at SQL Bits X), and the performance of this method is very fast. In fact there’s very little difference in performance between using this method and using the SSIS Merge Join component.
This is now my preferred approach to loading Type 2 SCDs, slightly faster methods may be available, but as we’ll see in later blog posts, this is such a quick method to implement, as well as being incredibly flexible as it can be controlled entirely from metadata.
Long live the Merge statement!