It is always good practice to do as much data preparation as close to the sources as you can before importing or connecting them to your Power BI reports, but what if there are circumstances where this isn’t possible?
I had an issue recently where a third-party application had been updated and both the new and legacy versions were being used side-by-side. Logging data from both versions was being written to two separate Azure SQL databases.
The customer needed a Power BI report showing both old and new logging data sets as a single source. If both databases were SQL Server databases, I could have written a view with a cross-database join, imported that into Power BI and thought no more about it. However, the two sources being Azure SQL Databases, with no easy way to join the tables, caused an issue.
This is where the Merge and Append functions in Power Query come in.
The first step to solve my issue was to create a view in each of the Azure databases that were identical in structure (you’ll see why later on) and import these into Power BI.
Now these data sources (referred to as ‘Queries’ by Power BI) have both been imported into the data model, we have two options with regards to how to combine them, ‘Append’ and ‘Merge’.
Although I didn’t use the ‘Merge’ function, I have included some information about it here as it is still relevant. ‘Merge’ is useful when you have columns from one source that you would like to add to another, the simplest way to think about it is that it works in the same way a JOIN works in SQL, in fact, when you enter the ‘Merge’ wizard there is a ‘Join Kind’ option:
This is how the ‘Merge’ function works:
To solve my issue (explained above), I used the ‘Append’ function. While still combining 2 sources it concatenates one query with another, it’s SQL equivalent would be a UNION ALL. ‘Append’ gives you the option to combine as many tables as you wish, regardless of structure. If a column exists in one query but not another, the column will be filled with NULLS where applicable, this is why it was important to create the two identical views at the very beginning.
This is how the ‘Append’ function works:
When I appended the two sources, I chose the option to create ‘as New’ so I can hide the original 2 queries and I have a nice new table (as seen below) which I can rename, tidy up and make a bit more user friendly, ready for the report consumer to use.
As previously mentioned, data preparation should be done as close to the source as possible, but in situations where this is difficult or simply isn’t possible, it’s important to know about these Power BI functions and how you can use them to your advantage.
More information can be found here:
Append queries – Power Query | Microsoft Docs
Merge queries overview – Power Query | Microsoft Docs
Well after 3.5 years, I’ve finally completed my MSc Business Intelligence – hoorah! And to reward the time, effort and increased grey hair, they saw fit to give me a merit as well.
During the last year I’ve been writing a thesis investigating the performance characteristics of loading data into data warehouse dimensions. Specifically loading Type 2 SCDs using SSIS.
For those who have followed the previous posts and my conference talks on using T-SQL Merge for the purpose, you won’t be surprised at the direction of the dissertation, but it provides a useful performance comparison between T-SQL Merge, SSIS Merge Join, SSIS Lookup and the SSIS SCD Wizard.
I won’t go into the full details here of the project or results, but will show a couple of the summary charts which are of most interest. You can download the full project here:
- PDF: Performance comparison of techniques to load Type 2 slowly changing dimensions in a Kimball style data warehouse using SSIS
The charts below shows the duration taken for the Lookup, Merge and Merge-Join methods (SCD Wizard excluded for obvious reasons!).
The top chart shows the performance on a Raid 10 array of traditional hard disks.
The second chart shows the same tests run on a Fusion IO NAND flash card.
The charts clearly show that the Lookup method is the least favoured. Of the other two, Merge is [just] preferred when using solid state, although statistically they are equivalent. On HDDs, Merge and Merge-Join are equivalent until you’re loading 2-3m rows per batch, at which point Merge-Join becomes the preferred option.
Full test results and analysis in the PDF download above.
My previous few posts show how using a T-SQL approach like Merge can provide huge development benefits by automating the code. This research now shows that unless you’re loading very large data volumes the performance is equivalent to more traditional approaches.
Hope this is of use. If you want to know a bit more without reading the full 99 pages & 23k words (who could blame you?!), then my SQLBits talk video is now on-line here. This talk is slightly out of date as it was presented before I’d finished the research and analysis, but it’s largely accurate. I presented a more up to date version on a webinar for the PASS Virtual BI chapter. The recording isn’t currently available [When this post was written] but should be up soon. Keep checking on the BI PASS Chapter website.
SQLBits X Video Now available
The video of my talk at SQLBits X is now available on the SQLBits website here. The talk was focused on presenting the results of my MSc Business Intelligence dissertation, comparing the performance of different methods of using SSIS to load data warehouse dimensions, specifically type 2 SCDs.
The talk also covers a comparison of the performance between traditional hard disks and solid state storage systems such as Fusion IO.
I then present a method of using the T-SQL Merge statement to automate a significant part of the ETL process.
You can find the code behind the demos on various recent Frog-Blog posts, and there is more to come, so look back soon!
- Introduction to T-SQL Merge Basics
- Using T-SQL Merge to Load SCD Dimensions
- Automating T-SQL Merge to Load SCD Dimensions
PASS BI Virtual Chapter Talk
For those that would rather hear the talk presented live, or want to ask any questions, please join me at a repeat of this talk over Live Meeting for the PASS BI Virtual Chapter on Thursday 28th June, at 5pm UK time, 12pm EDT (US). You can find the details on the PASS BI chapter website here
This is the 3rd post in the Frog-Blog series on the awesomeness of T-SQL Merge.
- Post 1: Introduction to T-SQL merge basics
- Post 2: Using T-SQL merge to load data warehouse dimensions
In this post we’ll be looking at how we can automate the creation of the merge statement to reduce development time and improve reliability and flexibility of the ETL process. I discussed this in the 2nd half of a talk I gave at the UK technical launch of SQL Server 2012 at SQLBits X. Thank you to the great audience who came to that talk, this post is for your benefit and is a result of the feedback and requests from you guys.
Why automate merge?
As we saw in the previous post, merge is an incredibly powerful tool when loading data into data warehouse dimensions (specifically SCDs – slowly changing dimensions). The whole process can be wrapped up into a very neat stored proc which can save a considerable mount of time writing the equivalent functionality in SSIS. In the next installment of this series I’ll be discussing the performance of it compared to other methods of loading SCDs in SSIS (take a look at the SQLBits talk video [when it’s released] for a preview!). Suffice to say for now that in my [pretty comprehensive] tests it’s one of the fastest methods of loading SCDs.
If you missed the talk, you can download the slide deck here whilst you’re waiting for the video.
The problem that stops a lot of people using merge is the perceived complexity of the statement. It can be very easy to get things wrong, with pretty bad consequences on your dimension data.
The easiest way to avoid this complexity and simplify the process is to not write merge statements, but let an automated procedure to it for you – Simples!.
The other huge benefit is that, as we’ll see during this post, you can base the automation procedure on metadata, meaning that you can change the SCD functionality of your dimensions just by changing metadata, and not rewriting your code.
Note that in this post we’ll just be looking at Type 0 and 1 SCDs, not 2, 3 or 6. This is to keep things simple. Once you’ve mastered type 0 and 1, it’s a logical next step to expand things to deal with type 2s.
OK, so how do we do this?
First of all we need to set up two tables to use. Let’s create a simple Customer dimension. Alongside this we also need a staging table. I’m a big fan of using schemas to differentiate tables, so we’ll create dim.Customer and etl.Customer as our two tables.
CREATE SCHEMA [dim] AUTHORIZATION [dbo] GO CREATE SCHEMA [etl] AUTHORIZATION [dbo] GO CREATE TABLE [dim].[Customer]( [CustomerKey] [int] IDENTITY(1,1) NOT NULL, [Email] [varchar](255) NOT NULL, [FirstName] [varchar](50) NOT NULL, [LastName] [varchar](50) NOT NULL, [DoB] [date] NOT NULL, [Sex] [char](1) NOT NULL, [MaritalStatus] [varchar](10) NOT NULL, [FirstCreated] [date] NOT NULL, [IsRowCurrent] [bit] NOT NULL, [ValidFrom] [datetime] NOT NULL, [ValidTo] [datetime] NOT NULL, [LastUpdated] [datetime] NOT NULL CONSTRAINT [PK_DimCustomer] PRIMARY KEY CLUSTERED ( [CustomerKey] ASC )) GO CREATE TABLE [etl].[Customer]( [Email] [varchar](255) NOT NULL, [FirstName] [varchar](50) NOT NULL, [LastName] [varchar](50) NOT NULL, [DoB] [date] NOT NULL, [Sex] [char](1) NOT NULL, [MaritalStatus] [varchar](10) NOT NULL, [FirstCreated] [date] NOT NULL )
So the dim table contains our primary surrogate key, business key (email address in this case), customer details and a series of audit fields (IsRowCurrent, ValidFrom, etc.). The etl staging table only contains the business key and customer details.
We then need to store the details of each field. i.e. how should each field be interpreted – is it a primary key, business, key, type 0 or 1, or an audit field. We need this so that we can put the correct fields into the correct place in the merge statement. You could create a table to store this information, however I prefer to use the extended properties of the fields.
EXEC sys.sp_addextendedproperty @level2name=N'CustomerKey', @value=N'PK' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'Email', @value=N'BK' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'FirstName', @value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'LastName', @value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'DoB', @value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'Sex', @value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'MaritalStatus',@value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'FirstCreated', @value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'ValidFrom', @value=N'Audit' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'ValidTo', @value=N'Audit' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'IsRowCurrent', @value=N'Audit' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'LastUpdated', @value=N'Audit' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
Or you can obviously just enter the extended property manually using SSMS
The SSIS package should output all customer records into the etl table, with no regard for whether they are new customers, old customers, changed or not. The merge statement will take care of that.
The first stage is to examine the structure of merge.
MERGE [DIMENSION TABLE] as Target USING [STAGING TABLE] as Source ON [LIST OF BUSINESS KEY FIELDS] WHEN MATCHED AND Target.[LIST OF TYPE 1 FIELDS] <> Source.[LIST OF TYPE 1 FIELDS] THEN UPDATE SET [LIST OF TYPE 1 FIELDS] = Source.[LIST OF TYPE 1 FIELDS] WHEN NOT MATCHED THEN INSERT [LIST OF ALL FIELDS] VALUES Source.[LIST OF ALL FIELDS]
The text in black is the skeleton of the statement, with the text in red being the details specific to the dimension. It’s these red items which we need to retrieve from the metadata of the dimension in order to create the full merge statement.
We can retrieve the extended properties using the sys.extended_properties DMV. This allows us to pull out a list of all fields which have a specific extended property set, e.g. all PK fields, all BK fields, all type 2 fields etc. etc. If we then put a few of these queries into cursors, we can loop through them and build up a dynamic SQL query. Yes I know, dynamic SQL should be avoided and is evil etc., however… this use is an exception and does truly make the World a better place.
I’m not going to explain the resulting proc in minute detail, so instead please just download it here and work through it yourself. I will however explain a couple of items which are pretty important:
It’s important to keep the naming convention of your dimensions consistent. This doesn’t mean that every dimension must be identical, some may need inferred member support, some may need type 2 tracking fields (e.g. IsRowCurrent) and some may not; the critical thing is that all of your fields, if they do exist, should be named consistently. The automation proc can then look for specific field names and include them in the merge statement if necessary.
There is a parameter in the proc called @Execute. This offers the possibility of either executing the resulting merge statement directly, or just printing out the statement. If you only want to use this to automate the development process then this allows you to do just that, you can then just copy and paste the resulting statement into SSIS or into a stored proc.
The automated generation of T-SQL merge statement to handle type 0 & 1 SCDs!
Hopefully you can see how you can expand this to also cope with Type 2 SCDs, following the structure in my earlier posts.
Download the SQL scripts here
In my last blog post I showed the basic concepts of using the T-SQL Merge statement, available in SQL Server 2008 onwards.
In this post we’ll take it a step further and show how we can use it for loading data warehouse dimensions, and managing the SCD (slowly changing dimension) process. Before we start, let’s have a quick catch up on what an SCD is…
What is a Slowly Changing Dimension (SCD)?
If you want a full explanation of slowly changing dimensions then you’ve come to the wrong place, I’m assuming a moderate level of experience of SCDs here, check out Wikipedia for some background, but in short, they manage the tracking of attribute history in dimensional data warehouses.
Most data warehouses contain type 0, 1 and 2 SCDs, so we’ll cope with those for now.
- Type 0 – Ignore updates
- Type 1 – Only keep latest version
- Type 2 – Track history by creating a new row
Type 2 is commonly stored in a fashion similar to this.
Both records show the same customer but in this case Jane got married and changed her name & title. We terminate the old record by setting IsRowCurrent=0 and create a new record with the new details. Each row also contains ValidFrom and ValidTo dates which allow us to identify the correct record for a particular point in time.
That’s enough of that, let’s get on with doing this using Merge
Using Merge to load SCD
The first stage is to save the output rows from the ETL process to a staging table. We can then use Merge to process these into the live dimension.
We saw in the previous post how to either insert or update a record depending on whether it already exists. We can start with this and enhance as we go. First lets figure out what logic we want to perform
- If the record doesn’t exist, create it
- If the record does exist
- Type 0 fields – ignore
- Type 1 fields – update fields
- Type 2 fields – terminate existing record, insert a new record
- If the record exists in the dimension, but not in the updated source file – terminate record
The last option is rarely used in my experience, as it only works when you perform a full load of the dimension every time. It’s more common to process an incremental load, but I’ve included it here for completeness.
The main difference here, over a basic upsert, is the handling of type 2s; we need to perform two separate operations on the dimension for every incoming record. Firstly we terminate the existing row then we have to insert a new row.
The T-SQL Merge statement can only update a single row per incoming row, but there’s a trick that we can take advantage of by making use of the OUTPUT clause. Merge can output the results of what it has done, which in turn can be consumed by a separate INSERT statement.
We’ll therefore use the MERGE statement to update the existing record, terminating it, and then pass the relevant source rows out to the INSERT statement to create the new row.
Let’s look at an example. Download the code here which will create the necessary tables and data to work on.
Main Merge Statement
We’ll start with a statement very similar to the previous post, with only a couple of minor amendments:
- We include IsRowCurrent into the joining clause. We only ever want to update the current records, not the history.
- DoB is removed from the WHEN MATCHED clause. We’re going to treat DoB as a type 1 change, if it’s updated then we assume it’s a correction rather than a new date of birth which should be tracked. We’ll deal with this Type 1 later on
- The UPDATE statement in the WHEN MATCHED clause doesn’t change the fields, only terminates the row by setting the IsRowCurrent and ValidTo fields (as well as LastUpdated)
MERGE Customer AS [Target] USING StagingCustomer AS [Source] ON Target.Email = Source.Email AND Target.IsRowCurrent = 1 WHEN MATCHED AND ( Target.FirstName <> Source.FirstName OR Target.LastName <> Source.LastName OR Target.Title <> Source.Title ) THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE() WHEN NOT MATCHED BY TARGET THEN INSERT ( FirstName ,LastName ,Title ,DoB ,Email ,LastUpdated ,IsRowCurrent ,ValidFrom ,ValidTo ) VALUES ( Source.FirstName ,Source.LastName ,Source.Title ,Source.DoB ,Source.Email ,GETDATE() --LastUpdated ,1 --IsRowCurrent ,GETDATE() --ValidFrom ,'9999-12-31' --ValidTo ) WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1 THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE()
The ‘When Matched’ section includes extra clauses which define which fields should be treated as Type 2.
The ‘When Not Matched By Target’ section deals with inserting the new records which didn’t previously exist.
The ‘When Not Matched By Source’ section deals with terminating records which are no longer received from the source. Usually this section can be deleted, especially if the data is received incrementally.
*** UPDATE *** Thank you to Sergey (in the comments below) for pointing out an error in this code. I’ve now corrected the ‘WHEN NOT MATCHED BY SOURCE’ line to include ‘AND Target.IsRowCurrent=1’. If this is omitted then all historic (IsRowCurrent=0) records are always updated with today’s date. We only want to terminate current records, not already terminated records.
We then add an OUTPUT clause to the end of the statement
OUTPUT $action AS Action ,Source.*
The OUTPUT clause tells MERGE to generate an output dataset. This can consist of any of the Source table’s fields or the Target table’s fields. We can also specify $Action as an extra field which will identify, for each row, whether
it was dealt with via an INSERT, UPDATE or DELETE. For this purpose we only care about the UPDATES, so we’ll use this to filter the records later on. We also only need the Source data, not the Target, so we’ll return Source.*
We wrap this up within an INSERT statement which will insert the new record for the changed dimension member.
INSERT INTO Customer ( FirstName ,LastName ,Title ,DoB ,Email ,LastUpdated ,IsRowCurrent ,ValidFrom ,ValidTo ) SELECT FirstName ,LastName ,Title ,DoB ,Email ,GETDATE() --LastUpdated ,1 --IsRowCurrent ,GETDATE() --ValidFrom ,'9999-12-31' --ValidTo FROM ( MERGE Customer AS [Target] USING StagingCustomer AS [Source] ON Target.Email = Source.Email AND Target.IsRowCurrent = 1 WHEN MATCHED AND ( Target.FirstName <> Source.FirstName OR Target.LastName <> Source.LastName OR Target.Title <> Source.Title ) THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE() WHEN NOT MATCHED BY TARGET THEN INSERT ( FirstName ,LastName ,Title ,DoB ,Email ,LastUpdated ,IsRowCurrent ,ValidFrom ,ValidTo ) VALUES ( Source.FirstName ,Source.LastName ,Source.Title ,Source.DoB ,Source.Email ,GETDATE() --LastUpdated ,1 --IsRowCurrent ,GETDATE() --ValidFrom ,'9999-12-31' --ValidTo ) WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1 THEN UPDATE SET IsRowCurrent = 0 ,LastUpdated = GETDATE() ,ValidTo = GETDATE() OUTPUT $action AS Action ,[Source].* ) AS MergeOutput WHERE MergeOutput.Action = 'UPDATE' AND Email IS NOT NULL ;
Note that the output clause is restricted so we only return the ‘UPDATE’ rows. As we’re using the email field as the business key, we should also ensure that we only insert records which have a valid email address.
So Type 2 changes have now been dealt with, by terminating the old version of the record and inserting the new version. Type 0 fields are just left out of the entire process, so are taken care of by just ignoring them. Therefore the only thing left is to manage the Type 1 fields.
We have two options here;
- Update all historical records to the new value
- Update only the current record to the new value
These are obviously only valid when there is a mix of type 1 and 2 attributes. If we’re just looking at Type 1 then there will be no historical records. In a true Type 1 scenario the first option is correct. All history (of Type 1 fields) is lost. The second option can be a valid option when it would be beneficial to keep a limited history of Type 1 fields.
This would mean that historical records created by Type 2 changes also keep a record of the Type 1 attribute values that were valid at the time the record was terminated. It doesn’t keep a full history of Type 1 attributes but sometimes this can be useful.
UPDATE C SET DoB = SC.DoB ,LastUpdated = GETDATE() FROM Customer C INNER JOIN StagingCustomer SC ON C.Email = SC.Email --AND C.IsRowCurrent = 1 --Optional AND C.DoB <> SC.DoB
This block of code updates the Type 1 attributes (in this case, DoB). The line 7 (the IsRowCurrent) check is optional depending on whether you only want to update current or all records.
So in one SQL statement we’ve managed the entire load process of all Type 2 SCDs, and with one more we’ve also managed all Type 1 fields.
I’ve been performing a large number of performance tests on loading Type 2s using various methods (another blog post to follow, as well as a talk that I’ll be presenting at SQL Bits X), and the performance of this method is very fast. In fact there’s very little difference in performance between using this method and using the SSIS Merge Join component.
This is now my preferred approach to loading Type 2 SCDs, slightly faster methods may be available, but as we’ll see in later blog posts, this is such a quick method to implement, as well as being incredibly flexible as it can be controlled entirely from metadata.
Long live the Merge statement!
Update: 9th Nov 2019
Added a follow-up post to demonstrate how to capture insert/update/delete counts as outputs from the merge statement, which you can read here.
A number of Frog-Blog posts over the next couple of months are going to make heavy use of the awesome SQL Server MERGE statement, introduced in SQL Server 2008. I thought it best to write an introduction post to provide the basics and groundwork for future posts.
So what does the Merge statement do?
Merge upserts data into a table. If, like me, you find the word ‘upsert’ more than a little ridiculous, we can scrap it in favour of a slightly more sensible description; It both inserts and updates data at the same time. Much better.
How does it do it?
In a similar way in which we’d use an UPDATE statement, we need to provide a source table and a destination table. We then give it a method of matching rows between them, usually the primary key or business/natural key (this can use multiple fields).
We can then specify a number of actions depending on whether a match is found or not.
If a match is found we may want to update the existing record.
If a match isn’t found then it’s likely that we’ll want to insert a new record.
What’s the syntax?
First we set the tables and the joining key(s)
MERGE [Destination Table] AS Target USING [Source Table] AS Source ON Target.KeyField = Source.KeyField
Then we define what we want to happen if we find a match (optional)
If you want you can specify other conditions, such as checking whether anything has changed
AND ( Target.Field1 <> Source.Field1 OR Target.Field2 <> Source.Field2 )
We can then perform an action. i.e. an update, delete, etc.
THEN UPDATE SET Field1 = Source.Field1 , Field2 = Source.Field2
Then we define what we want to happen if we don’t find a matching record (optional)
WHEN NOT MATCHED
In this case, we want to insert the new record
THEN INSERT ( KeyField , Field1 , Field2 ) VALUES ( Source.KeyField , Source.Field1 , Source.Field2 );
As this is the last part of the statement we have to terminate the merge with a semi colon.
Time for an example
We’ve got a customer table, and we receive an updated set of customer details which we want to import into our table. Assuming we’ve imported the new details into a staging table, we can use merge to do the work for us.
Use this script to create the necessary tables & data.
Our existing Customer table should look like this
And our staging table (new customer data) should look like
Firstly lets define the logic that we’re looking for.
- Our new customer data doesn’t include the CustomerID, so we can’t match on the primary key. Instead we’ll use the email address as the key.
- If the customer already exists then we want to check whether the fields are different, if they are then update the customer record.
- If the customer doesn’t exist then insert them as a new customer record.
- The Customer table contains audit fields which must be maintained. DTInserted should contain the datetime when the record was first created. DTUpdated should be updated whenever any of the data in a row is changed.
This is all performed using the following statement
MERGE Customer AS [Target] USING StagingCustomer AS [Source] ON Target.Email = Source.Email WHEN MATCHED AND ( Target.FirstName <> Source.FirstName OR Target.LastName <> Source.LastName OR Target.Title <> Source.Title OR Target.DoB <> Source.DoB ) THEN UPDATE SET FirstName = Source.FirstName ,LastName = Source.LastName ,Title = Source.Title ,DoB = Source.DoB ,DTUpdated = GETDATE() WHEN NOT MATCHED THEN INSERT ( Email ,FirstName ,LastName ,Title ,DoB ,DTInserted ,DTUpdated ) VALUES ( Source.Email ,Source.FirstName ,Source.LastName ,Source.Title ,Source.DoB ,GETDATE() ,GETDATE() );
It’s a long statement and looks complex at first glance, but as soon as you break it down it’s remarkably simple and easy to write.
If you run this, you should see 4 rows affected – 2 new rows inserted (Jack & Mary) and 2 rows updated (Phil & Jane). The new data has been inserted/updated into the Customer table and the audit fields have been kept up to date.
Note however that (if you look at the data in the sample scripts) Sarah & Andy exist in the Customer table but are missing from the new customer list in the update file. With the code above they just stay as they are in the Customer table, we assume they haven’t changed. What if we wanted to identify these customers and flag them as inactive or even delete them?
If we look at the ‘WHEN NOT MATCHED’ line, what this is actually doing is defaulting to ‘WHEN NOT MATCHED BY TARGET’. i.e. if you don’t find the record in the target table then do something.
We can add an extra section for ‘WHEN NOT MATCHED BY SOURCE’ which will allow us to specify an action when a row in our Customer table doesn’t have a matching record in the source staging data.
We can add:
WHEN NOT MATCHED BY SOURCE THEN DELETE
Which will simply delete the row in the Customer table, or we could update the record and terminate it using something like this.
WHEN NOT MATCHED BY SOURCE THEN UPDATE SET IsActive = 0
Don’t forget that the entire merge statement needs to be terminated with a semi colon, and we should also update the audit field. We also only want to update the record once, so check that the customer isn’t already terminated. We therefore end up with:
MERGE Customer AS [Target] USING StagingCustomer AS [Source] ON Target.Email = Source.Email WHEN MATCHED AND ( Target.FirstName <> Source.FirstName OR Target.LastName <> Source.LastName OR Target.Title <> Source.Title OR Target.DoB <> Source.DoB ) THEN UPDATE SET FirstName = Source.FirstName ,LastName = Source.LastName ,Title = Source.Title ,DoB = Source.DoB ,DTUpdated = GETDATE() WHEN NOT MATCHED BY TARGET THEN INSERT ( Email ,FirstName ,LastName ,Title ,DoB ,IsActive ,DTInserted ,DTUpdated ) VALUES ( Source.Email ,Source.FirstName ,Source.LastName ,Source.Title ,Source.DoB ,1 ,GETDATE() ,GETDATE() ) WHEN NOT MATCHED BY SOURCE AND Target.IsActive=1 THEN UPDATE SET IsActive = 0 ,DTUpdated = GETDATE() ;
Which leaves our Customer table looking like this
There are many variations and uses for this, we’ll explore its use in data warehouse dimension loading over the next couple of posts.
For now, Frog-Blog out.
[UPDATE] The next blog post in this series is now available: using T-SQL Merge to load Type 2 SCDs into a data warehouse dimension.