In my last blog post I showed the basic concepts of using the T-SQL Merge statement, available in SQL Server 2008 onwards.
In this post we’ll take it a step further and show how we can use it for loading data warehouse dimensions, and managing the SCD (slowly changing dimension) process. Before we start, let’s have a quick catch up on what an SCD is…
What is a Slowly Changing Dimension (SCD)?
If you want a full explanation of slowly changing dimensions then you’ve come to the wrong place, I’m assuming a moderate level of experience of SCDs here, check out Wikipedia for some background, but in short, they manage the tracking of attribute history in dimensional data warehouses.
Most data warehouses contain type 0, 1 and 2 SCDs, so we’ll cope with those for now.
- Type 0 – Ignore updates
- Type 1 – Only keep latest version
- Type 2 – Track history by creating a new row
Type 2 is commonly stored in a fashion similar to this.
Both records show the same customer but in this case Jane got married and changed her name & title. We terminate the old record by setting IsRowCurrent=0 and create a new record with the new details. Each row also contains ValidFrom and ValidTo dates which allow us to identify the correct record for a particular point in time.
That’s enough of that, let’s get on with doing this using Merge
Using Merge to load SCD
The first stage is to save the output rows from the ETL process to a staging table. We can then use Merge to process these into the live dimension.
We saw in the previous post how to either insert or update a record depending on whether it already exists. We can start with this and enhance as we go. First lets figure out what logic we want to perform
- If the record doesn’t exist, create it
- If the record does exist
- Type 0 fields – ignore
- Type 1 fields – update fields
- Type 2 fields – terminate existing record, insert a new record
- If the record exists in the dimension, but not in the updated source file – terminate record
The last option is rarely used in my experience, as it only works when you perform a full load of the dimension every time. It’s more common to process an incremental load, but I’ve included it here for completeness.
The main difference here, over a basic upsert, is the handling of type 2s; we need to perform two separate operations on the dimension for every incoming record. Firstly we terminate the existing row then we have to insert a new row.
The T-SQL Merge statement can only update a single row per incoming row, but there’s a trick that we can take advantage of by making use of the OUTPUT clause. Merge can output the results of what it has done, which in turn can be consumed by a separate INSERT statement.
We’ll therefore use the MERGE statement to update the existing record, terminating it, and then pass the relevant source rows out to the INSERT statement to create the new row.
Let’s look at an example. Download the code here which will create the necessary tables and data to work on.
Main Merge Statement
We’ll start with a statement very similar to the previous post, with only a couple of minor amendments:
- We include IsRowCurrent into the joining clause. We only ever want to update the current records, not the history.
- DoB is removed from the WHEN MATCHED clause. We’re going to treat DoB as a type 1 change, if it’s updated then we assume it’s a correction rather than a new date of birth which should be tracked. We’ll deal with this Type 1 later on
- The UPDATE statement in the WHEN MATCHED clause doesn’t change the fields, only terminates the row by setting the IsRowCurrent and ValidTo fields (as well as LastUpdated)
MERGE Customer AS [Target] USING StagingCustomer AS [Source] ON Target.Email = Source.Email AND Target.IsRowCurrent = 1 WHEN MATCHED AND ( Target.FirstName <> Source.FirstName OR Target.LastName <> Source.LastName OR Target.Title <> Source.Title ) THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE() WHEN NOT MATCHED BY TARGET THEN INSERT ( FirstName ,LastName ,Title ,DoB ,Email ,LastUpdated ,IsRowCurrent ,ValidFrom ,ValidTo ) VALUES ( Source.FirstName ,Source.LastName ,Source.Title ,Source.DoB ,Source.Email ,GETDATE() --LastUpdated ,1 --IsRowCurrent ,GETDATE() --ValidFrom ,'9999-12-31' --ValidTo ) WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1 THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE()
The ‘When Matched’ section includes extra clauses which define which fields should be treated as Type 2.
The ‘When Not Matched By Target’ section deals with inserting the new records which didn’t previously exist.
The ‘When Not Matched By Source’ section deals with terminating records which are no longer received from the source. Usually this section can be deleted, especially if the data is received incrementally.
*** UPDATE *** Thank you to Sergey (in the comments below) for pointing out an error in this code. I’ve now corrected the ‘WHEN NOT MATCHED BY SOURCE’ line to include ‘AND Target.IsRowCurrent=1’. If this is omitted then all historic (IsRowCurrent=0) records are always updated with today’s date. We only want to terminate current records, not already terminated records.
We then add an OUTPUT clause to the end of the statement
OUTPUT $action AS Action ,Source.*
The OUTPUT clause tells MERGE to generate an output dataset. This can consist of any of the Source table’s fields or the Target table’s fields. We can also specify $Action as an extra field which will identify, for each row, whether
it was dealt with via an INSERT, UPDATE or DELETE. For this purpose we only care about the UPDATES, so we’ll use this to filter the records later on. We also only need the Source data, not the Target, so we’ll return Source.*
We wrap this up within an INSERT statement which will insert the new record for the changed dimension member.
INSERT INTO Customer ( FirstName ,LastName ,Title ,DoB ,Email ,LastUpdated ,IsRowCurrent ,ValidFrom ,ValidTo ) SELECT FirstName ,LastName ,Title ,DoB ,Email ,GETDATE() --LastUpdated ,1 --IsRowCurrent ,GETDATE() --ValidFrom ,'9999-12-31' --ValidTo FROM ( MERGE Customer AS [Target] USING StagingCustomer AS [Source] ON Target.Email = Source.Email AND Target.IsRowCurrent = 1 WHEN MATCHED AND ( Target.FirstName <> Source.FirstName OR Target.LastName <> Source.LastName OR Target.Title <> Source.Title ) THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE() WHEN NOT MATCHED BY TARGET THEN INSERT ( FirstName ,LastName ,Title ,DoB ,Email ,LastUpdated ,IsRowCurrent ,ValidFrom ,ValidTo ) VALUES ( Source.FirstName ,Source.LastName ,Source.Title ,Source.DoB ,Source.Email ,GETDATE() --LastUpdated ,1 --IsRowCurrent ,GETDATE() --ValidFrom ,'9999-12-31' --ValidTo ) WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1 THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE() OUTPUT $action AS Action ,[Source].* ) AS MergeOutput WHERE MergeOutput.Action = 'UPDATE' AND Email IS NOT NULL ;
Note that the output clause is restricted so we only return the ‘UPDATE’ rows. As we’re using the email field as the business key, we should also ensure that we only insert records which have a valid email address.
So Type 2 changes have now been dealt with, by terminating the old version of the record and inserting the new version. Type 0 fields are just left out of the entire process, so are taken care of by just ignoring them. Therefore the only thing left is to manage the Type 1 fields.
We have two options here;
- Update all historical records to the new value
- Update only the current record to the new value
These are obviously only valid when there is a mix of type 1 and 2 attributes. If we’re just looking at Type 1 then there will be no historical records. In a true Type 1 scenario the first option is correct. All history (of Type 1 fields) is lost. The second option can be a valid option when it would be beneficial to keep a limited history of Type 1 fields.
This would mean that historical records created by Type 2 changes also keep a record of the Type 1 attribute values that were valid at the time the record was terminated. It doesn’t keep a full history of Type 1 attributes but sometimes this can be useful.
UPDATE C SET DoB = SC.DoB ,LastUpdated = GETDATE() FROM Customer C INNER JOIN StagingCustomer SC ON C.Email = SC.Email --AND C.IsRowCurrent = 1 --Optional AND C.DoB <> SC.DoB
This block of code updates the Type 1 attributes (in this case, DoB). The line 7 (the IsRowCurrent) check is optional depending on whether you only want to update current or all records.
So in one SQL statement we’ve managed the entire load process of all Type 2 SCDs, and with one more we’ve also managed all Type 1 fields.
I’ve been performing a large number of performance tests on loading Type 2s using various methods (another blog post to follow, as well as a talk that I’ll be presenting at SQL Bits X), and the performance of this method is very fast. In fact there’s very little difference in performance between using this method and using the SSIS Merge Join component.
This is now my preferred approach to loading Type 2 SCDs, slightly faster methods may be available, but as we’ll see in later blog posts, this is such a quick method to implement, as well as being incredibly flexible as it can be controlled entirely from metadata.
Long live the Merge statement!
Update: 9th Nov 2019
Added a follow-up post to demonstrate how to capture insert/update/delete counts as outputs from the merge statement, which you can read here.
Frog-Blog Out
In SSIS packages we used several times MergeSCD component, home site [url]http://dimensionmergescd.codeplex.com/[/url]. It’s proved to be quite efficient and uses Merge T-SQL command too.
Hi Andrei
Good point, thanks for posting.
I should clarify this – it looks like the Merge SCD component uses the SSIS Merge Join component, not the T-SQL merge command – similar name, quite different functionality…
Both offer very good performance, not much to choose between them other than the specifics of the implementation. The later blog posts in this series will clarify my preference for T-SQL merge
Hi Alex.
I think that there is a bug in you code.
1)For example
In table Customer ( ‘Jane’, ‘Smith’, ‘Miss’, ‘1984-07-19’, ‘Jane@emailme.com’, ‘2010-01-01 12:00’, 1, ‘2010-01-01 12:00’, ‘9999-12-31’)
Step 1 merge staging when changing Surname
( ‘Jane’, ‘King’, ‘Mrs’, ‘1984-07-19’, ‘Jane@emailme.com’)
Step 2 merge staging, when again changes Surname.
( ‘Jane’, ‘Jackson’, ‘Mrs’, ‘1984-07-19’, ‘Jane@emailme.com’).
And the bug is that after step 2 we will have ValidTo in historical data same, but must be different
CustomerKey FirstName LastName Title DoB Email LastUpdated IsRowCurrent ValidFrom ValidTo
5 Jane Smith Miss 1984-07-19 Jane@emailme.com 2012-05-10 20:16:29.530 0 2010-01-01 12:00:00.000 2012-05-10 20:16:29.530
6 Jane King Mrs 1984-07-19 Jane@emailme.com 2012-05-10 20:16:29.530 0 2012-05-10 20:15:40.920 2012-05-10 20:16:29.530
7 Jane Jackson Mrs 1984-07-19 Jane@emailme.com 2012-05-10 20:16:29.530 1 2012-05-10 20:16:29.530 9999-12-31 00:00:00.000
2) I don’t know why but when make update code like
WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1
THEN UPDATE SET
IsRowCurrent = 0
,LastUpdated = GETDATE()
,ValidTo = GETDATE()
everything work fine and ValiTo different.
Hi Sergey,
Thanks for the comment. Looks like you’re probably missing the Target.IsRowCurrent=1 check from the USING clause in the merge statement – if you add that in then you shouldn’t need to have it in the WHEN NOT MATCHED clause.
Let me know if this fixes it,
Alex
Tnks, for reply. But in example that i show you I copy you code where Target.IsRowCurrent=1 is in USING Clause.
I found this error in the testing phase of my own merge scd 2 base on you code. In the example above, I just ran your script several times to see if bug in my code only or a bug initially.Try running a few times and see the bug.
I’ve just had a closer look and you’re absolutely right – very well spotted! I’ll correct the code in the post when I’m back in the office.
It’s because the WHEN NOT MATCHED BY SOURCE finds all existing rows not matched by the USING clause, which includes rows that have already been terminated. Obvious now you’ve pointed it out!
Thanks for letting me know, much appreciated.
Alex
Hi Alex,
I would like to know about the performance between Merge statement and SCD component from Ralph Kimbal(in Codeplex). I’ve never test with a large dimension table.
Regards,
Jump
Hi Alex,
Just seen your thesis. That’s what I want to know.
Thanks again
Jump
Hi Jump
Yeah the performance stats are all in my thesis in the next post.
For the benefit of others, the Kimball SCD component (now called the Dimension Merge SCD component) uses the Merge Join method (or something similar) in SSIS. I didn’t investigate the performance of the Dimension Merge SCD component directly but the performance stats of the Merge Join tests should be very similar.
Regards
Alex
Nice article Alex,
Though how do you deal with the issue of having Foreign Keys on the Dimension table?
as inserting into a table with a FK is not possible from a nested merge, or do you generally not have any FK constraints on your DW.
I have seen the work-around to use a temp table prior to the insert but was wondering if you knew of any other way.
JS.
Hi John
Yes that’s a very good point indeed, FKs do cause a problem with merge. In data warehouses however it’s commonplace to not enforce them. I create them with NOCHECK, so the relationships are present, but they’re not enforced. The referential integrity is enforced by the ETL, not by the database. This has the added benefit of speeding up bulk inserts…
Alex
Hi,
Alex, have you posted the new modified above script? Where you mentioned you will correct the code posted on May 10, 2012 at 22:01. Would you please let me know whether you have posted it or not?
Hi, yes the code in the main post has been updated so is good to go.
Alex
Hi Alex,
Great article. I wonder how you handle errors in data? Whole statement may fail because of one bad row. In SSIS you can handle such a situation. How to structure data warehouse to benfit from this solution?
Hi Hali, great question.
There are two things to consider here. Firstly the data doesn’t go straight from the source into the dimension via merge. It still has to be processed by SSIS and stored in the etl staging table. Therefore any business logic checks can still be dealt with in the SSIS package, and any data insertion errors will fail when they’re inserted into the staging table. These can be dealt with exactly the same as they would be otherwise, by rejecting the whole batch, or by rejecting the single failed row.
Then we get to the merge. By this time, we know that the data is validated, verified and ready for insertion into the dimension, so the only errors can really be OS, SQL or corrupt data already in the table, in which case in my opinion you’d want the whole batch to roll back.
Does that makes sense?
ALex
Thanks for answer.
And yes, that explains a lot. Does it mean that three areas are needed in data warehouse: one with raw data, second with cleaned data and third where actual DW resides and dimensions are stored? Should all of them contain a full copy of all transaction data, what is your preferred way of doing this?
Exactly yes. I tend to have a staging database for raw data, then two schemas in the data warehouse, etl and dim. I can then have two copies of each dimension table etl.dimension and dim.dimension. The merge proc then merges all contents from etl into dim. Only the actual dimension needs to contain the full history of data. Both staging and the etl tables only ever need to have the incremental changes being loaded.
Thanks again.
I found some other interesting articles on your blog, and I have just subscribe it.
Best Regards,
Karol
Thanks for the explanation, she led me in the right direction
Hi Alex,
GREAT Article, I need some help on creating a dim table for track Employment status of an Employee. Basically I have to look in two tables :-
Employee_details
(employee_details_id, employment_Start_Date, Employment_end_date) and
Employee_status_history table
(Status_hitory_Id,employee_details_id,StatusName,StatusStartDate,StatusEndDate)
There is no entry in employment_status_history table when the person joins the company it has no entry in employment_status_history until when there is any of the below status (employment_status)change for the employee. In essence all employees will not have an entry in employment_status_history – those who don’t have entry are assumed to be active Employee since the member start date (i.e employee_details.start_date).
When there is an entry in Employee_details END_date it means employee is InActive.
Inemployment_status_history table Only , 8 and 9 status (Permanent Retirement and Left Service) are inactive Employment and end date is NULL if there is end_date for Left Service it means the Employee has re-joined.
Do you just want to know the current status of the employee? Or track the history and count how many employees you have in each status over time? They are two very different problem.
If you want to track and report on history then it sounds like you need to create a transactional fact table from your employee data. It’s a fact table problem, not a dimension problem. Have a look at a conference talk that I gave about this topic here
http://sqlbits.com/Sessions/Event11/Data_Modeling_for_Analysis_Services_Cubes
The last demo (loan application) can be converted to your situation.
Think very carefully about exactly what it is you’re trying to measure in your fact table, and then model it accordingly.
If you just want to get the current status for the employee dimension then it’s just a SQL query problem.
I’d use a ROW_NUMBER() windowing function to find the last update for each employee. Left join that to the employee_details table, then do a case statement to check the different conditions:
– case when the EmploymentEndDate is not null then they’re inactive
– when the StatusName is null then you know they’re active
– when the status in (8, 9) then inactive
– else active
Or something similar, depending on your exact business logic.
Thank you so much for explaining the logic. It is really helpful. My only challenge is to figure how to insert a row WHEN NOT MATCHED BY SOURCE. Currently, your logic just updates the existing row which is fine for the Exp Date. Our business rules require that we have a row for “deleted” records with that row’s effective date and the standard current row expiry date of 12/31/9999.
Hi Sam
OK so it I understand this right, you want to close off the current row if not matched by source, as the sample code above already does. However for those updates you also want to insert an additional row for the closed item.
You can insert the new row using the outer ‘insert’ statement, which is fed from the ‘OUTPUT’ clause.
In the example above we’re only returning the [Source].* fields to the OUTPUT clause, however you can also pass the [Target].* fields to it, then in the INSERT statement you can chose either the source or target. In your example you’d want to use the target fields, so that you essentially clone the record that was just closed off.
Hope that makes sense!
Alex
Hi Alex,
I hope you don’t mind my asking if you could send me a sample logic for doing the “insert the new row using the outer ‘insert’ statement, which is fed from the ‘OUTPUT’ clause.”
Thanks again!
Sam
Thanks Alex…I will give it a shot! Otherwise, your script simply trumps everything I have seen.
Hey Alex, This was a wonderful tutorial, I tried this but still it replaces the historic data also with current changes. Please help me with this. did your second update also but no luck.
appreciate your response.
Thanks,
Kasun
Hi Kasun
I need to see a copy of your procs to diagnose – I’ll email you
Regards
Alex
Hi Friends,
i have one question.
while working with SQL merge statement in Execute SQL task component in SSIS ,
i have to compare the source and target data to insert, update the target tables. But here the source and Target tables are from different databases.
so, how to manage these both databases and how to write the SQL Merge Query.
how to work in that scenarion.
can anybody solve my prob.
thanks in advance
Hi Supraja
All you need to do is include the database name in the source and/or target table names.
E.g. MyDB.dbo.MyTable instead of just MyTable
Regards
Alex
Hi Alex,
Great article!
The only thing I wanted to point out: your code will not work as intended in case data changes to/from NULL value to non-NULL value because of this:
WHEN MATCHED AND
(
Target.FirstName Source.FirstName
OR Target.LastName Source.LastName
OR Target.Title Source.Title
)
This except should be changed to something like
WHEN MATCHED AND
(
— Neither is null, values are not equal
NOT (Target.Field1 IS NULL OR Source.Field1 IS NULL)
AND Target.Field1 Source.Field1
— Source xor target is null
OR (Target.Field1 IS NULL OR Source.Field1 IS NULL)
AND NOT (Target.Field1 IS NULL AND Source.Field1 IS NULL)
— OR … Repeat for other columns
)
Please see http://stackoverflow.com/questions/1075142/how-to-compare-values-which-may-both-be-null-is-t-sql for more info this link has some other ways of implementing this functionality (answer by WileCau).
Yes you’re right, thanks for pointing it out.
In our DW implementations we never allow NULLs into the database, so we don’t have to worry about them, but good info for others who allow NULLs.
Alex
How do you approach this and what do you put instead of NULLs? I work with NHS data and I can’t imagine not having NULLs for things like DateOfBirth or DateOfDeath.
All dimensions have two fixed records with keys of 0 (n/a) and -1 (unknown). Date is no different. So if it’s a date dimension then it uses these. If it’s an attribute then 01/01/1900 or 31/12/9999, whichever is most appropriate for the attribute.
And re the effective date or IsRowCurrent, the simple answer is to always have both. So we implement ValidFrom, ValidTo and IsRowCurrent in every dimension.
What are the benefits of this 0/1 (I assume it is ‘0’/’1’ in case of chars) approach? This will definitely make ETL process more complicated as it is not clear as how to differentiate between 0 (n/a) and -1 (unknown) if only NULL value is available in source table? Again for example in case of genuine DateOfBirth in 19th century it is necessary to use 01/01/1800 instead.
Any particular reasons to use both IsRowCurrent and ValidFrom/ValidTo?
All dimensions should have integer surrogate keys. Two records are hard coded for each; with keys of 0 and -1. This should include the date dimension.
If a fact table has a foreign key to the ‘Date of Death’ dimension, then there are three possibilities
– The patient has died and we know the date. In which case link to the date
– The patient has died and we don’t know the date. In which case link to -1, the ‘Unknown’ member
– The patient is still living. In which case link to 0, the ‘n/a’ member
NULL doesn’t differentiate between the 2nd and 3rd case. And also if you’re pulling data into a cube, nulls foreign keys cause you all sorts of grief.
If the date is stored as an attribute and not a dimension, then we wouldn’t use -1/0, we would just fix the attribute to a specific date.
The reason that we recommend that you have both IsRowCurrent and ValidFrom/ValidTo is for convenience. Each are better for different purposes. For finding the current row, use the IsRowCurrent flag. For identifying a record on a specific date, you can use the dates. We design data warehouses for flexibility and simplicity by design, adding an IsRowCurrent flag takes negligible space, and gives the users/developers the best of both worlds. I don’t understand why there is still debate over this, when the simple answer is to use both. The question should really be, why NOT use both?
Alex,
Thank you for your reply. Somehow could not reply to the most recent post.
I will definitely follow your approach.
BTW all columns in main Customer table in the code attached are NULLable (except for PK). That is why I was confused. I didn’t realise that Staging table has all NOT NULL columns.
BTW what is your view on EffectiveDate vs IsRowCurrent debate? http://sqlblog.com/blogs/jamie_thomson/archive/2009/11/28/debunking-kimball-effective-dates.aspx
Hi Alex,
One known disadvantage about using the merge statement for loading data into te warehouse is that I have to have both databases in the same server and I lose hardware scalability. What do you think about this? Having the staging area and the datawarehouse in the same server is a good practice?
Thanks,
Jose
Hi Jose
We need to be careful about which staging environment we’re talking about. In this case we would have two staging areas; the first is the results of the extract process, often called the landing area. Yes you’re right that this should be detached from the main DW as it’s logically a different component and has no business being in the same DW database. So this could be put on a different server.
However the staging area we’re talking about in this post is not really a true staging area, it’s a temporary area purely used to martial data into the dimensions. It’s purpose is to simplify the code and improve the performance of managing SCD handling. Each table is almost an identical clone of its corresponding dimension, and this table should be seen as purely an extension of the data flow pipeline. It wouldn’t make sense to split this out into a different server, the point is that we’re trying to improve the performance of getting data into the dim, moving this staging area to a different server would degrade performance.
I have a little case here that I hope to solve using Your code With a twist
Here is the case:
I am collecting data from several Sources about Companies, but let us Your customer approach
I can get several rows the same day or one row another day for the same customer. I wish to use SCD2 to keep history, but some of the Sources are not giving me data for all Fields. I could get let’s say ‘N/A’ since NULL is not allowed.
My leader wants me to give him
a) if the two or more rows are identical except date, should result in one row With oldest date
b) if one or more Fields are changed, create a New scd2 row With changed date as startdate
c) if one or several fields in the New row has changed from a legal value to ‘N/A’, it should have the latest legal value (from the previous row) for those Fields
Here is the example data
Start data set
CustomerHistoryId CustomerNum CustomerName Planet ChangeDate
1 101 Anakin Skywalker Tatooine 14.03.2015 15:41
2 102 Yoda Coruscant 14.03.2015 15:41
3 103 Obi-Wan Kenobi Coruscant 24.03.2015 15:41
4 102 Yoda Coruscant 29.03.2015 15:41
5 102 Yoda Empty 03.04.2015 15:41
6 102 Yoda Empty 04.04.2015 15:41
7 103 Obi-Wan Kenobi Degobah 08.04.2015 15:41
8 102 Master Yoda Tatooine 09.04.2015 15:41
9 102 Empty Tatooine 10.04.2015 15:41
10 102 Master Yoda Tatooine 11.04.2015 15:41
End data set before making SCD2
CustomerHistoryId CustomerNum CustomerName Planet ChangeDate
1 101 Anakin Skywalker Tatooine 14.03.2015 15:41
2 102 Yoda Coruscant 14.03.2015 15:41
3 103 Obi-Wan Kenobi Coruscant 24.03.2015 15:41
7 103 Obi-Wan Kenobi Degobah 08.04.2015 15:41
8 102 Master Yoda Tatooine 09.04.2015 15:41
I am looking for a SQL code solving this 🙂
I hope you can spend some minutes to help With this
Thanks again
Regards Geir Forsmo, BI consultant, Evry Norway
Hi Geir
The code to generate a Type 2 SCD using merge is a lot more complicated than Type 1. I do have the code, and I use it on a daily basis, but I’ve intentionally not included it in the blog post as I don’t want anyone copying and pasting it without truly understanding the logic and functionality, so I provide the Type 1 code which is the starting point for you to build your own Type 2 statement.
But to get you started, here are a few pointers:
– You should include an additional ‘IsRowCurrent’ flag. That makes it much easier to limit the merge statement to only work with the latest version of each customer.
– Merge should only ever work on a single row per customer at a time. If a single customer has multiple rows in the source (e.g. multiple HistoryId per Customer in your example), they either need to be filtered down to 1 each, or process then one at a time, depending on the granularity of the Type 2 SCD that you need (you can use windowing functions such as ROW_NUMBER to achieve both of these)
– Make use of the Output clause of the merge statement. You can use this to create the new Type 2 records.
Hope that helps
Alex
Hi Alex, I have target tale with 1.5 million records and few updates in staging table. I used below mentioned Merge statement and that statement keep on running for 15-20 mins and doesn’t produce any results. I am wondering its because of number of records in target or problem with this merge statement. Any pointer will be great help.
INSERT INTO [dbo].[COPY_DIM_PERSON]
( [VERSION_START_DATE]
,[VERSION_END_DATE]
,[VERSION_CURRENT_FLAG]
,[PERSON_OID]
,[BIRTH_DATE]
,[DECEASED_DATE]
,[MARITAL_STATUS]
,[SPOUSE_BIRTH_DATE]
,[GENDER]
,[PRIMARY_POSTAL_ADDRESS_REGION]
,[IMMIGRANT_FLAG]
,[IMMIGRANT_STATUS]
,[VISIBLE_MINORITY_FLAG]
,[VISIBLE_MINORITY]
,[ABORIGINAL_FLAG]
,[ABORIGINAL_BAND_NUMBER]
,[ABORIGINAL_REGISTRATION_NUMBER]
,[ABORIGINAL_RESERVE_STATUS]
,[FIRST_NAME]
,[OTHER_GIVEN_NAMES]
,[LAST_NAME]
,[SOCIAL_INSURANCE_NUMBER]
,[HEALTH_INSURANCE_NUMBER]
,[HUMAN_SERVICES_ID_NUMBER]
,[FAMILY_COMPOSITION]
,[FUNERAL_EXPENSE_FLAG]
,[SOURCE_SYSTEM_ID]
,[SOURCE_FILE_ID]
,[CREATE_PROCESS_ID]
,[CREATE_DATETIME]
,[UPDATE_PROCESS_ID]
,[UPDATE_DATETIME]
) SELECT
GETDATE() –Version Start Date
,’9999-12-31′ –Version End Date
,’Y’
,PERSON_OID
,BIRTH_DATE
,DECEASED_DATE
,MARITAL_STATUS
,’ ‘ –Spouse Birth Date
,GENDER
,’ ‘
,IMMIGRANT_FLAG
,IMMIGRANT_STATUS
,ISNULL(VISIBLE_MINORITY_FLAG,’N’)
,VISIBLE_MINORITY
,ABORIGINAL_FLAG
,ABORIGINAL_BAND_NUMBER
,ABORIGINAL_REGISTRATION_NUMBER
,ABORIGINAL_ON_RESERVE_STATUS
,ISNULL(FIRST_NAME,’Unknown’)
,OTHER_GIVEN_NAMES
,ISNULL(LAST_NAME,’Unknown’)
,SOCIAL_INSURANCE_NUMBER
,HEALTH_INSURANCE_NUMBER
,CONVERT(varbinary(20),HUMAN_SERVICES_ID_NUMBER,1)
,” –Family Composition
,FUNERAL_EXPENSE_FLAG
,1 –Source System ID
,1 –Source File ID
,1 –Create Process ID
,GETDATE() –Create Date Time
,1 –UPDATE Process ID
,GETDATE() –Update DateTime
FROM (
MERGE [dbo].[COPY_DIM_PERSON] AS [Target]
USING [dbo].[SATG_DIM_PERSON] AS [Source]
ON Target.PERSON_OID = Source.PERSON_OID
AND Target.VERSION_CURRENT_FLAG = ‘Y’
WHEN MATCHED AND
(
Target.MARITAL_STATUS Source.MARITAL_STATUS
OR Target.GENDER Source.GENDER
OR Target.IMMIGRANT_FLAG Source.IMMIGRANT_FLAG
OR Target.IMMIGRANT_STATUS Source.IMMIGRANT_STATUS
–OR Target.VISIBLE_MINORITY_FLAG Source.VISIBLE_MINORITY_FLAG
OR Target.VISIBLE_MINORITY Source.VISIBLE_MINORITY
OR Target.ABORIGINAL_FLAG Source.ABORIGINAL_FLAG
OR Target.ABORIGINAL_BAND_NUMBER Source.ABORIGINAL_BAND_NUMBER
OR Target.ABORIGINAL_REGISTRATION_NUMBER Source.ABORIGINAL_REGISTRATION_NUMBER
OR Target.ABORIGINAL_RESERVE_STATUS Source.ABORIGINAL_ON_RESERVE_STATUS
OR ISNULL(Target.FIRST_NAME,’Unkown’) Source.FIRST_NAME
OR ISNULL(Target.OTHER_GIVEN_NAMES,’Unkown’) Source.OTHER_GIVEN_NAMES
OR ISNULL(Target.LAST_NAME,’Unkown’) Source.LAST_NAME
–OR Target.FAMILY_COMPOSITION Source.FAMILY_COMPOSITION
OR Target.FUNERAL_EXPENSE_FLAG Source.FUNERAL_EXPENSE_FLAG
)
THEN UPDATE SET
VERSION_CURRENT_FLAG = ‘N’
,UPDATE_DATETIME = GETDATE()
,VERSION_END_DATE = GETDATE() -1
WHEN NOT MATCHED BY TARGET
THEN INSERT (
[VERSION_START_DATE]
,[VERSION_END_DATE]
,[VERSION_CURRENT_FLAG]
,[PERSON_OID]
,[BIRTH_DATE]
,[DECEASED_DATE]
,[MARITAL_STATUS]
,[SPOUSE_BIRTH_DATE]
,[GENDER]
,[PRIMARY_POSTAL_ADDRESS_REGION]
,[IMMIGRANT_FLAG]
,[IMMIGRANT_STATUS]
,[VISIBLE_MINORITY_FLAG]
,[VISIBLE_MINORITY]
,[ABORIGINAL_FLAG]
,[ABORIGINAL_BAND_NUMBER]
,[ABORIGINAL_REGISTRATION_NUMBER]
,[ABORIGINAL_RESERVE_STATUS]
,[FIRST_NAME]
,[OTHER_GIVEN_NAMES]
,[LAST_NAME]
,[SOCIAL_INSURANCE_NUMBER]
,[HEALTH_INSURANCE_NUMBER]
,[HUMAN_SERVICES_ID_NUMBER]
,[FAMILY_COMPOSITION]
,[FUNERAL_EXPENSE_FLAG]
,[SOURCE_SYSTEM_ID]
,[SOURCE_FILE_ID]
,[CREATE_PROCESS_ID]
,[CREATE_DATETIME]
,[UPDATE_PROCESS_ID]
,[UPDATE_DATETIME]
) VALUES (
GETDATE() –Version Start Date
,’9999-12-31′ –Version End Date
,’Y’
,Source.PERSON_OID
,Source.BIRTH_DATE
,Source.DECEASED_DATE
,Source.MARITAL_STATUS
,” –Spouse Birth Date
,Source.GENDER
,” –Primary Postal Address
,Source.IMMIGRANT_FLAG
,Source.IMMIGRANT_STATUS
,ISNULL(Source.VISIBLE_MINORITY_FLAG,’N’)
,Source.VISIBLE_MINORITY
,Source.ABORIGINAL_FLAG
,Source.ABORIGINAL_BAND_NUMBER
,Source.ABORIGINAL_REGISTRATION_NUMBER
,Source.ABORIGINAL_ON_RESERVE_STATUS
,ISNULL(Source.FIRST_NAME,’Unknown’)
,Source.OTHER_GIVEN_NAMES
,ISNULL(Source.LAST_NAME,’Unknown’)
,Source.SOCIAL_INSURANCE_NUMBER
,Source.HEALTH_INSURANCE_NUMBER
,CONVERT(varbinary(20), Source.HUMAN_SERVICES_ID_NUMBER,1)
,” –Family Composition
,Source.FUNERAL_EXPENSE_FLAG
,1 –Source System ID
,1 –Source File ID
,1 –Create Process ID
,GETDATE() –Create Date Time
,1 –UPDATE Process ID
,GETDATE() –Update DateTime
)
WHEN NOT MATCHED BY SOURCE AND Target.VERSION_CURRENT_FLAG = ‘Y’
THEN UPDATE SET
VERSION_CURRENT_FLAG = ‘N’
,UPDATE_DATETIME = GETDATE()
,VERSION_END_DATE = GETDATE()
OUTPUT $action AS Action
,[Source].*
) AS MergeOutput
WHERE MergeOutput.Action = ‘UPDATE’
AND PERSON_OID IS NOT NULL
;
I ran tests on this approach importing up to 5m rows from staging into the dimension, which contained up to 50m records. You can see a summary of the results at
https://purplefrogsystems.com/2012/07/msc-dissertation-performance-of-loading-scds-in-ssis/
And you can also download a full PDF of my analysis from the same link.
In my test of a similar scale to yours, mine completed in around 10 minutes.
Please note that the merge statement executes in a single transaction, so it is very heavy on TempDB, memory and the database log. So the first thing to check is whether your server has the necessary power, and whether it is configured correctly to handle this kind of workload.
I’d then test it on a smaller dataset to make sure the functionality works correctly, and then scale the dataset up, logging the duration every time, building up a timing model that will allow you to predict the duration taken for other data sizes. Then you can see how long your 1.5m test should take, and take it from there.
Thanks
Alex
Just a suggestion:
Have you tried commenting out all [Source] columns in OUTPUT?
Hi Pavel, When I comment out source columns, it gives me compilation error in SELECT statement of first insert (where we are closing record and adding similar record for type 2). Any suggestions to overcome this? Thanks!!
In this case try to leave [Source].PERSON_OID instead of [Source].*.
It also looks like sign equal is missed in all SOURCE and TARGET colums just before first UPDATE.
Please let me know if this helps.
Nopes 🙁
After doing [Source].PERSON_OID, same error but with the remaining fields.
Yeah I know, equal sign got removed when I pasted code from SSMS to this blog. Code run fine when i reduce number of rows in target table. But with full load noting happen for hours.
Creating a index on joining column on staging table helps making MERGE faster. My observation, can’t grantee 🙂
Thanks Alex!
I reduced my dataset to few rows and merge statement works like a charm. But with full data set and on my development server its so bad. Its been an hour and its still running. Seems like IT Infrastructure guys got some work to do 🙂
Thanks again for pointing me in right direction!!
Hi Alex, Is there any way we can calculate how many Updates and Inserts are made by Merge statement while loading SCD type 2 dimension.
Thanks
Hi Alex,
What do you think of other methods to identify changes to existing records?
In the WHEN MATCHED part you use
Target.FirstName Source.FirstName OR … etc.
One method is to create a checksum field on both tables and compare that.
Another trick I saw was to use the INTERSECT statement to compare the tables.
Thanks for sharing your opinion.
Cheers,
Edgar Walther
Personally I’m not a fan of using checksums.
– There’s always a chance of collisions, which would result in incorrect data in the dimension.
– They add an additional unnecessary column to the dim
– The only benefit is saving lines of code. But if you automate the code then what’s the benefit?
So I’m yet to be convinced of why checksums are a good idea.
Regarding INTERSECT, this would return records that exist and haven’t changed. EXCEPT would return rows that don’t exist so could be used to identify new rows. But neither provide the same power as merge, which does both at once, as well as finding business keys that exist at the same time as checking whether their attributed have changed.
So yes you could make it work but I’d be very surprised if it offered more functionality, flexibility or performance than merge.
Alex
Thanks for your reply.
I thought that comparing just one field (the checksum) might be (much) faster than comparing all fields. Isn’t that true?
Thanks, Edgar
Don’t forget that you also have to take into account the cost of calculating the checksums. You’d have to test it to see if it makes a difference in your scenario.
What is the code for:
WHEN NOT MATCHED BY SOURCE AND Target.VERSION_CURRENT_FLAG = ‘Y’
In case that you will receive only changed records (incremetal load) ?
With the sample if a record is not in the new load the CurrentFlag is updated to ‘N’ and the ValidTo Date to the date of yesterday.
But what I would like to see is that nothing should be changed in case an existing record is in the target but not in the source.
Hi Marco
“WHEN NOT MATCHED BY SOURCE” does what you suggest, it identifies records that are in the target but not in the source, so you can mark them as deleted, delete them, update them etc.
If you don’t want to do anything in this scenario then just delete the whole WHEN NOT MATCHED BY SOURCE section of the query.
Thanks
Alex
Do SQL merge statements like this remain the best way to update a SCD, or has SQLServer2017 or 2019 brought anything new to the table?
Hi Gavin
In my (humble!) opinion this remains by far the easiest and best way of managing SCD loads. Nothing in new versions of SQL Server has changed this, even (in fact especially) if you move to a ‘big data’ ELT or ETLT model.
Alex
I haven’t used merge statements for a while, I stumbled across your talk from SQL Bits and have been using it in some development work. I’m interested to know how you handle capturing rowcounts when having to UPSERT?
Given that you have to filter the OUTPUT to only return $Action = ‘UPDATE’ you lose the ability to capture the other actions!
Yes you’re right, so there’s a compromise to be made between functionality and logging.
If you want full row count logging you need to do multiple queries, and accept the performance hit.
Meh that’s a real shame :/ thanks for responding…
Alex, this is “my version of MERGE” and the tool I mentioned you last time:
https://sqlplayer.net/2018/01/scd-type-1-type-2-in-merge-statement/
Great stuff! I used your template to implement this in BIML together with the Kimball modelling excelsheet (extended properties). Works like a charm. Thanks for sharing!
This is hands down the best explanation about the loading of warehouse dimension tables I have ever read. Kudos!!!
I’m getting the following error when I try to run this code: The target table ‘TableName’ of the INSERT statement cannot be on either side of a (primary key, foreign key) relationship when the FROM clause contains a nested INSERT, UPDATE, DELETE, or MERGE statement. Found reference constraint ‘FK__ForeignKey__TechC__3C34F16F’.
Do you know what is causing the error and how to work around it?
Hi Jordan
Merge has some restrictions when there are foreign key constraints on the table.
The easy option is to set the foreign keys to not be enforced, using the NOCHECK option.
Alex
Perfect. Thanks Alex!
Hi Alex,
How is the target table available in the subquery and not locked during the initial insert statement? Thanks!
Interesting and good question! The Merge statement handles this without any problems when only running a single Merge at a time against one table.
As rows can only be matched by one condition then they can only be inserted or updated, the same row can’t be updated more than once in the same query, meaning there are no issues.
However if you could have two merge statements running at the same time then you can easily run into a deadlock, which can be avoided with the HOLDLOCK hint.
Alex
Hi Alex,
How to capture the updates and inserts in the MERGE statement.
Thanks
Hi Dave
You can use the Output clause to capture this, further details at:
https://www.sqlservercentral.com/articles/the-output-clause-for-the-merge-statements
Alex
Thanks Alex. I want to capture the count of new records inserted and count of changed records inserted or count of existing records updated.
Can you please let me know how to capture using OUTPUT and action$
Hi Dave
Easier to explain in another post, so I’ve written it up with examples at
https://purplefrogsystems.com/2019/11/capturing-insert-and-update-counts-from-merge/
Hope that makes sense and helps
Alex
Thanks Alex. Can you please let me know where the statement : INSERT INTO #MergeActions ([Action])
be places in the SCD2 Merge Statement like above.
Hi Dave
You can’t use this with the Type 2, as the OUTPUT clause is used by the Type 2 functionality.
You’d need to split the merge statement up into multiple statements instead of one if you want to get the row counts out.
Alex
Good blog, but I find those massive merge statements really hard to understand. I think I need to reread the post again.
What do you think of this approach ? Does it accomplish the same thing ?
https://blog.cloudera.com/update-hive-tables-easy-way-2/
Thanks.
Yes it is a very similar approach, with the only real difference being using a union to accomplish the two stages of type 2 instead of using the output clause.
With SCDs there has to be some coding or dev, it’s just finding an approach and tech that works for you. To minimise this we automate the generation of our merge statements from metadata.
Interesting option using hash, which does work but you do have to be very wary of collisions, where different values generate the same hash. This is a major problem when using hash in this way so be careful.