0845 643 64 63

Merge

MSc Dissertation – Performance of Loading SCDs in SSIS

Well after 3.5 years, I’ve finally completed my MSc Business Intelligence – hoorah! And to reward the time, effort and increased grey hair, they saw fit to give me a merit as well.

During the last year I’ve been writing a thesis investigating the performance characteristics of loading data into data warehouse dimensions. Specifically loading Type 2 SCDs using SSIS.

For those who have followed the previous posts and my conference talks on using T-SQL Merge for the purpose, you won’t be surprised at the direction of the dissertation, but it provides a useful performance comparison between T-SQL Merge, SSIS Merge Join, SSIS Lookup and the SSIS SCD Wizard.

I won’t go into the full details here of the project or results, but will show a couple of the summary charts which are of most interest. You can download the full project here:

The charts below shows the duration taken for the Lookup, Merge and Merge-Join methods (SCD Wizard excluded for obvious reasons!).
The top chart shows the performance on a Raid 10 array of traditional hard disks.
The second chart shows the same tests run on a Fusion IO NAND flash card.

The charts clearly show that the Lookup method is the least favoured. Of the other two, Merge is [just] preferred when using solid state, although statistically they are equivalent. On HDDs, Merge and Merge-Join are equivalent until you’re loading 2-3m rows per batch, at which point Merge-Join becomes the preferred option.

Full test results and analysis in the PDF download above.

My previous few posts show how using a T-SQL approach like Merge can provide huge development benefits by automating the code. This research now shows that unless you’re loading very large data volumes the performance is equivalent to more traditional approaches.

Hope this is of use. If you want to know a bit more without reading the full 99 pages & 23k words (who could blame you?!), then my SQLBits talk video is now on-line here. This talk is slightly out of date as it was presented before I’d finished the research and analysis, but it’s largely accurate. I presented a more up to date version on a webinar for the PASS Virtual BI chapter. The recording isn’t currently available [When this post was written] but should be up soon. Keep checking on the BI PASS Chapter website.

Frog-Blog-Out

SQLBits X Video available

SQLBits X Video Now available

The video of my talk at SQLBits X is now available on the SQLBits website here. The talk was focused on presenting the results of my MSc Business Intelligence dissertation, comparing the performance of different methods of using SSIS to load data warehouse dimensions, specifically type 2 SCDs.

The talk also covers a comparison of the performance between traditional hard disks and solid state storage systems such as Fusion IO.

I then present a method of using the T-SQL Merge statement to automate a significant part of the ETL process.

You can find the code behind the demos on various recent Frog-Blog posts, and there is more to come, so look back soon!

 

PASS BI Virtual Chapter Talk

For those that would rather hear the talk presented live, or want to ask any questions, please join me at a repeat of this talk over Live Meeting for the PASS BI Virtual Chapter on Thursday 28th June, at 5pm UK time, 12pm EDT (US). You can find the details on the PASS BI chapter website here

Automating T-SQL Merge to load Dimensions (SCD)

This is the 3rd post in the Frog-Blog series on the awesomeness of T-SQL Merge.

In this post we’ll be looking at how we can automate the creation of the merge statement to reduce development time and improve reliability and flexibility of the ETL process. I discussed this in the 2nd half of a talk I gave at the UK technical launch of SQL Server 2012 at SQLBits X. Thank you to the great audience who came to that talk, this post is for your benefit and is a result of the feedback and requests from you guys.

Why automate merge?

As we saw in the previous post, merge is an incredibly powerful tool when loading data into data warehouse dimensions (specifically SCDs – slowly changing dimensions). The whole process can be wrapped up into a very neat stored proc which can save a considerable mount of time writing the equivalent functionality in SSIS. In the next installment of this series I’ll be discussing the performance of it compared to other methods of loading SCDs in SSIS (take a look at the SQLBits talk video [when it’s released] for a preview!). Suffice to say for now that in my [pretty comprehensive] tests it’s one of the fastest methods of loading SCDs.

If you missed the talk, you can download the slide deck here whilst you’re waiting for the video.

The problem that stops a lot of people using merge is the perceived complexity of the statement. It can be very easy to get things wrong, with pretty bad consequences on your dimension data.

The easiest way to avoid this complexity and simplify the process is to not write merge statements, but let an automated procedure to it for you – Simples!.

The other huge benefit is that, as we’ll see during this post, you can base the automation procedure on metadata, meaning that you can change the SCD functionality of your dimensions just by changing metadata, and not rewriting your code.

Note that in this post we’ll just be looking at Type 0 and 1 SCDs, not 2, 3 or 6. This is to keep things simple. Once you’ve mastered type 0 and 1, it’s a logical next step to expand things to deal with type 2s.

OK, so how do we do this?

First of all we need to set up two tables to use. Let’s create a simple Customer dimension. Alongside this we also need a staging table. I’m a big fan of using schemas to differentiate tables, so we’ll create dim.Customer and etl.Customer as our two tables.

CREATE SCHEMA [dim] AUTHORIZATION [dbo]
GO
CREATE SCHEMA [etl] AUTHORIZATION [dbo]
GO

CREATE TABLE [dim].[Customer](
    [CustomerKey]   [int] IDENTITY(1,1) NOT NULL,
    [Email]         [varchar](255)      NOT NULL,
    [FirstName]     [varchar](50)       NOT NULL,
    [LastName]      [varchar](50)       NOT NULL,
    [DoB]           [date]              NOT NULL,
    [Sex]           [char](1)           NOT NULL,
    [MaritalStatus] [varchar](10)       NOT NULL,
    [FirstCreated]  [date]              NOT NULL,
    [IsRowCurrent]  [bit]               NOT NULL,
    [ValidFrom]     [datetime]          NOT NULL,
    [ValidTo]       [datetime]          NOT NULL,
    [LastUpdated]   [datetime]          NOT NULL
 CONSTRAINT [PK_DimCustomer] PRIMARY KEY CLUSTERED 
(
	[CustomerKey] ASC
))
GO

CREATE TABLE [etl].[Customer](
    [Email]         [varchar](255)  NOT NULL,
    [FirstName]     [varchar](50)   NOT NULL,
    [LastName]      [varchar](50)   NOT NULL,
    [DoB]           [date]          NOT NULL,
    [Sex]           [char](1)       NOT NULL,
    [MaritalStatus] [varchar](10)   NOT NULL,
    [FirstCreated]  [date]          NOT NULL
)

So the dim table contains our primary surrogate key, business key (email address in this case), customer details and a series of audit fields (IsRowCurrent, ValidFrom, etc.). The etl staging table only contains the business key and customer details.

We then need to store the details of each field. i.e. how should each field be interpreted – is it a primary key, business, key, type 0 or 1, or an audit field. We need this so that we can put the correct fields into the correct place in the merge statement. You could create a table to store this information, however I prefer to use the extended properties of the fields.

EXEC sys.sp_addextendedproperty @level2name=N'CustomerKey',  @value=N'PK' ,    
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'Email',        @value=N'BK' ,    
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'FirstName',    @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'LastName',     @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'DoB',          @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'Sex',          @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'MaritalStatus',@value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'FirstCreated', @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'ValidFrom',    @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'ValidTo',      @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'IsRowCurrent', @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'LastUpdated',  @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'

Or you can obviously just enter the extended property manually using SSMS

The SSIS package should output all customer records into the etl table, with no regard for whether they are new customers, old customers, changed or not. The merge statement will take care of that.

Automating Merge

The first stage is to examine the structure of merge.

   MERGE   [DIMENSION TABLE]  as Target
   USING   [STAGING TABLE]    as Source
      ON   [LIST OF BUSINESS KEY FIELDS]
   WHEN MATCHED AND
         Target.[LIST OF TYPE 1 FIELDS] <> Source.[LIST OF TYPE 1 FIELDS]
      THEN UPDATE SET
         [LIST OF TYPE 1 FIELDS] = Source.[LIST OF TYPE 1 FIELDS]
   WHEN NOT MATCHED THEN INSERT
         [LIST OF ALL FIELDS]
      VALUES
         Source.[LIST OF ALL FIELDS]

The text in black is the skeleton of the statement, with the text in red being the details specific to the dimension. It’s these red items which we need to retrieve from the metadata of the dimension in order to create the full merge statement.

We can retrieve the extended properties using the sys.extended_properties DMV. This allows us to pull out a list of all fields which have a specific extended property set, e.g. all PK fields, all BK fields, all type 2 fields etc. etc. If we then put a few of these queries into cursors, we can loop through them and build up a dynamic SQL query. Yes I know, dynamic SQL should be avoided and is evil etc., however… this use is an exception and does truly make the World a better place.

I’m not going to explain the resulting proc in minute detail, so instead please just download it here and work through it yourself. I will however explain a couple of items which are pretty important:

It’s important to keep the naming convention of your dimensions consistent. This doesn’t mean that every dimension must be identical, some may need inferred member support, some may need type 2 tracking fields (e.g. IsRowCurrent) and some may not; the critical thing is that all of your fields, if they do exist, should be named consistently. The automation proc can then look for specific field names and include them in the merge statement if necessary.

There is a parameter in the proc called @Execute. This offers the possibility of either executing the resulting merge statement directly, or just printing out the statement. If you only want to use this to automate the development process then this allows you to do just that, you can then just copy and paste the resulting statement into SSIS or into a stored proc.

Result

The automated generation of T-SQL merge statement to handle type 0 & 1 SCDs!
Hopefully you can see how you can expand this to also cope with Type 2 SCDs, following the structure in my earlier posts.

Download the SQL scripts here
 
Frog-Blog Out

Using T-SQL Merge to load Data Warehouse dimensions

In my last blog post I showed the basic concepts of using the T-SQL Merge statement, available in SQL Server 2008 onwards.

In this post we’ll take it a step further and show how we can use it for loading data warehouse dimensions, and managing the SCD (slowly changing dimension) process. Before we start, let’s have a quick catch up on what an SCD is…

What is a Slowly Changing Dimension (SCD)?

If you want a full explanation of slowly changing dimensions then you’ve come to the wrong place, I’m assuming a moderate level of experience of SCDs here, check out Wikipedia for some background, but in short, they manage the tracking of attribute history in dimensional data warehouses.

Most data warehouses contain type 0, 1 and 2 SCDs, so we’ll cope with those for now.

  • Type 0 – Ignore updates
  • Type 1 – Only keep latest version
  • Type 2 – Track history by creating a new row

Type 2 is commonly stored in a fashion similar to this.

Both records show the same customer but in this case Jane got married and changed her name & title. We terminate the old record by setting IsRowCurrent=0 and create a new record with the new details. Each row also contains ValidFrom and ValidTo dates which allow us to identify the correct record for a particular point in time.

That’s enough of that, let’s get on with doing this using Merge

Using Merge to load SCD

The first stage is to save the output rows from the ETL process to a staging table. We can then use Merge to process these into the live dimension.

We saw in the previous post how to either insert or update a record depending on whether it already exists. We can start with this and enhance as we go. First lets figure out what logic we want to perform

  • If the record doesn’t exist, create it
  • If the record does exist
    • Type 0 fields – ignore
    • Type 1 fields – update fields
    • Type 2 fields – terminate existing record, insert a new record
  • If the record exists in the dimension, but not in the updated source file – terminate record

The last option is rarely used in my experience, as it only works when you perform a full load of the dimension every time. It’s more common to process an incremental load, but I’ve included it here for completeness.

The main difference here, over a basic upsert, is the handling of type 2s; we need to perform two separate operations on the dimension for every incoming record. Firstly we terminate the existing row then we have to insert a new row.

The T-SQL Merge statement can only update a single row per incoming row, but there’s a trick that we can take advantage of by making use of the OUTPUT clause. Merge can output the results of what it has done, which in turn can be consumed by a separate INSERT statement.

We’ll therefore use the MERGE statement to update the existing record, terminating it, and then pass the relevant source rows out to the INSERT statement to create the new row.

Let’s look at an example. Download the code here which will create the necessary tables and data to work on.

Main Merge Statement

We’ll start with a statement very similar to the previous post, with only a couple of minor amendments:

  • We include IsRowCurrent into the joining clause. We only ever want to update the current records, not the history.
  • DoB is removed from the WHEN MATCHED clause. We’re going to treat DoB as a type 1 change, if it’s updated then we assume it’s a correction rather than a new date of birth which should be tracked. We’ll deal with this Type 1 later on
  • The UPDATE statement in the WHEN MATCHED clause doesn’t change the fields, only terminates the row by setting the IsRowCurrent and ValidTo fields (as well as LastUpdated)
 MERGE Customer        AS [Target]
 USING StagingCustomer AS [Source]
    ON Target.Email        = Source.Email
   AND Target.IsRowCurrent = 1
 WHEN MATCHED AND
     (
          Target.FirstName <> Source.FirstName
       OR Target.LastName  <> Source.LastName
       OR Target.Title     <> Source.Title
     )
     THEN UPDATE SET
        IsRowCurrent     = 0
       ,LastUpdated      = GETDATE()
       ,ValidTo          = GETDATE()
 WHEN NOT MATCHED BY TARGET
     THEN INSERT (
         FirstName
        ,LastName
        ,Title
        ,DoB
        ,Email
        ,LastUpdated
        ,IsRowCurrent
        ,ValidFrom
        ,ValidTo
       ) VALUES (
         Source.FirstName
        ,Source.LastName
        ,Source.Title
        ,Source.DoB
        ,Source.Email
        ,GETDATE()      --LastUpdated
        ,1              --IsRowCurrent
        ,GETDATE()      --ValidFrom
        ,'9999-12-31'   --ValidTo
       )
 WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1
     THEN UPDATE SET
         IsRowCurrent = 0
        ,LastUpdated  = GETDATE()
        ,ValidTo      = GETDATE()

The ‘When Matched’ section includes extra clauses which define which fields should be treated as Type 2.

The ‘When Not Matched By Target’ section deals with inserting the new records which didn’t previously exist.

The ‘When Not Matched By Source’ section deals with terminating records which are no longer received from the source. Usually this section can be deleted, especially if the data is received incrementally.

*** UPDATE *** Thank you to Sergey (in the comments below) for pointing out an error in this code. I’ve now corrected the ‘WHEN NOT MATCHED BY SOURCE’ line to include ‘AND Target.IsRowCurrent=1’. If this is omitted then all historic (IsRowCurrent=0) records are always updated with today’s date. We only want to terminate current records, not already terminated records.

We then add an OUTPUT clause to the end of the statement

 OUTPUT $action AS Action
       ,Source.*

The OUTPUT clause tells MERGE to generate an output dataset. This can consist of any of the Source table’s fields or the Target table’s fields. We can also specify $Action as an extra field which will identify, for each row, whether

it was dealt with via an INSERT, UPDATE or DELETE. For this purpose we only care about the UPDATES, so we’ll use this to filter the records later on. We also only need the Source data, not the Target, so we’ll return Source.*

We wrap this up within an INSERT statement which will insert the new record for the changed dimension member.

INSERT INTO Customer
   ( FirstName
    ,LastName
    ,Title
    ,DoB
    ,Email
    ,LastUpdated
    ,IsRowCurrent
    ,ValidFrom
    ,ValidTo
   )
SELECT
     FirstName
    ,LastName
    ,Title
    ,DoB
    ,Email
    ,GETDATE()    --LastUpdated
    ,1            --IsRowCurrent
    ,GETDATE()    --ValidFrom
    ,'9999-12-31' --ValidTo
FROM (
  MERGE Customer        AS [Target]
  USING StagingCustomer AS [Source]
     ON Target.Email        = Source.Email
    AND Target.IsRowCurrent = 1
  WHEN MATCHED AND
      (
           Target.FirstName <> Source.FirstName
        OR Target.LastName  <> Source.LastName
        OR Target.Title     <> Source.Title
      )
      THEN UPDATE SET
         IsRowCurrent     = 0
        ,LastUpdated      = GETDATE()
        ,ValidTo          = GETDATE()
  WHEN NOT MATCHED BY TARGET
      THEN INSERT (
          FirstName
         ,LastName
         ,Title
         ,DoB
         ,Email
         ,LastUpdated
         ,IsRowCurrent
         ,ValidFrom
         ,ValidTo
        ) VALUES (
          Source.FirstName
         ,Source.LastName
         ,Source.Title
         ,Source.DoB
         ,Source.Email
         ,GETDATE()      --LastUpdated
         ,1              --IsRowCurrent
         ,GETDATE()      --ValidFrom
         ,'9999-12-31'   --ValidTo
        )
  WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1
      THEN UPDATE SET
          IsRowCurrent = 0
         ,LastUpdated  = GETDATE()
         ,ValidTo      = GETDATE()
  OUTPUT $action AS Action
        ,[Source].*
) AS MergeOutput
WHERE MergeOutput.Action = 'UPDATE'
  AND Email IS NOT NULL
;

Note that the output clause is restricted so we only return the ‘UPDATE’ rows. As we’re using the email field as the business key, we should also ensure that we only insert records which have a valid email address.

So Type 2 changes have now been dealt with, by terminating the old version of the record and inserting the new version. Type 0 fields are just left out of the entire process, so are taken care of by just ignoring them. Therefore the only thing left is to manage the Type 1 fields.

We have two options here;

  • Update all historical records to the new value
  • Update only the current record to the new value

These are obviously only valid when there is a mix of type 1 and 2 attributes. If we’re just looking at Type 1 then there will be no historical records. In a true Type 1 scenario the first option is correct. All history (of Type 1 fields) is lost. The second option can be a valid option when it would be beneficial to keep a limited history of Type 1 fields.

This would mean that historical records created by Type 2 changes also keep a record of the Type 1 attribute values that were valid at the time the record was terminated. It doesn’t keep a full history of Type 1 attributes but sometimes this can be useful.

  UPDATE C
      SET DoB         = SC.DoB
         ,LastUpdated = GETDATE()
  FROM Customer C
     INNER JOIN StagingCustomer SC
          ON C.Email        =  SC.Email
       --AND C.IsRowCurrent =  1      --Optional
         AND C.DoB          <> SC.DoB

This block of code updates the Type 1 attributes (in this case, DoB). The line 7 (the IsRowCurrent) check is optional depending on whether you only want to update current or all records.

So in one SQL statement we’ve managed the entire load process of all Type 2 SCDs, and with one more we’ve also managed all Type 1 fields.

I’ve been performing a large number of performance tests on loading Type 2s using various methods (another blog post to follow, as well as a talk that I’ll be presenting at SQL Bits X), and the performance of this method is very fast. In fact there’s very little difference in performance between using this method and using the SSIS Merge Join component.

This is now my preferred approach to loading Type 2 SCDs, slightly faster methods may be available, but as we’ll see in later blog posts, this is such a quick method to implement, as well as being incredibly flexible as it can be controlled entirely from metadata.

Long live the Merge statement!

Frog-Blog Out

Introduction to T-SQL Merge Basics

A number of Frog-Blog posts over the next couple of months are going to make heavy use of the awesome SQL Server MERGE statement, introduced in SQL Server 2008. I thought it best to write an introduction post to provide the basics and groundwork for future posts.

So what does the Merge statement do?

Merge upserts data into a table. If, like me, you find the word ‘upsert’ more than a little ridiculous, we can scrap it in favour of a slightly more sensible description; It both inserts and updates data at the same time. Much better.

How does it do it?

In a similar way in which we’d use an UPDATE statement, we need to provide a source table and a destination table. We then give it a method of matching rows between them, usually the primary key or business/natural key (this can use multiple fields).

We can then specify a number of actions depending on whether a match is found or not.

If a match is found we may want to update the existing record.

If a match isn’t found then it’s likely that we’ll want to insert a new record.

What’s the syntax?

First we set the tables and the joining key(s)

MERGE [Destination Table] AS Target
   USING [Source Table] AS Source
     ON Target.KeyField = Source.KeyField

Then we define what we want to happen if we find a match (optional)

  WHEN MATCHED

If you want you can specify other conditions, such as checking whether anything has changed

     AND  (
         Target.Field1 <> Source.Field1
         OR Target.Field2 <> Source.Field2
         )

We can then perform an action. i.e. an update, delete, etc.

  THEN UPDATE SET
        Field1 = Source.Field1
      , Field2 = Source.Field2

Then we define what we want to happen if we don’t find a matching record (optional)

   WHEN NOT MATCHED

In this case, we want to insert the new record

      THEN INSERT (
             KeyField
           , Field1
           , Field2
        ) VALUES (
             Source.KeyField
           , Source.Field1
           , Source.Field2
      );

As this is the last part of the statement we have to terminate the merge with a semi colon.

Time for an example

We’ve got a customer table, and we receive an updated set of customer details which we want to import into our table. Assuming we’ve imported the new details into a staging table, we can use merge to do the work for us.

Use this script to create the necessary tables & data.

Our existing Customer table should look like this

And our staging table (new customer data) should look like

Firstly lets define the logic that we’re looking for.

  • Our new customer data doesn’t include the CustomerID, so we can’t match on the primary key. Instead we’ll use the email address as the key.
  • If the customer already exists then we want to check whether the fields are different, if they are then update the customer record.
  • If the customer doesn’t exist then insert them as a new customer record.
  • The Customer table contains audit fields which must be maintained. DTInserted should contain the datetime when the record was first created. DTUpdated should be updated whenever any of the data in a row is changed.

This is all performed using the following statement

MERGE Customer        AS [Target]
USING StagingCustomer AS [Source]
   ON Target.Email = Source.Email
WHEN MATCHED AND
    (
         Target.FirstName <> Source.FirstName
      OR Target.LastName  <> Source.LastName
      OR Target.Title     <> Source.Title
      OR Target.DoB       <> Source.DoB
    )
    THEN UPDATE SET
       FirstName = Source.FirstName
      ,LastName  = Source.LastName
      ,Title     = Source.Title
      ,DoB       = Source.DoB
      ,DTUpdated = GETDATE()
WHEN NOT MATCHED
    THEN INSERT (
        Email
       ,FirstName
       ,LastName
       ,Title
       ,DoB
       ,DTInserted
       ,DTUpdated
      ) VALUES (
        Source.Email
       ,Source.FirstName
       ,Source.LastName
       ,Source.Title
       ,Source.DoB
       ,GETDATE()
       ,GETDATE()
      );

It’s a long statement and looks complex at first glance, but as soon as you break it down it’s remarkably simple and easy to write.

If you run this, you should see 4 rows affected – 2 new rows inserted (Jack & Mary) and 2 rows updated (Phil & Jane). The new data has been inserted/updated into the Customer table and the audit fields have been kept up to date.

Note however that (if you look at the data in the sample scripts) Sarah & Andy exist in the Customer table but are missing from the new customer list in the update file. With the code above they just stay as they are in the Customer table, we assume they haven’t changed. What if we wanted to identify these customers and flag them as inactive or even delete them?

If we look at the ‘WHEN NOT MATCHED’ line, what this is actually doing is defaulting to ‘WHEN NOT MATCHED BY TARGET’. i.e. if you don’t find the record in the target table then do something.
We can add an extra section for ‘WHEN NOT MATCHED BY SOURCE’ which will allow us to specify an action when a row in our Customer table doesn’t have a matching record in the source staging data.

We can add:

WHEN NOT MATCHED BY SOURCE
    THEN DELETE

Which will simply delete the row in the Customer table, or we could update the record and terminate it using something like this.

WHEN NOT MATCHED BY SOURCE
    THEN UPDATE SET
        IsActive = 0

Don’t forget that the entire merge statement needs to be terminated with a semi colon, and we should also update the audit field. We also only want to update the record once, so check that the customer isn’t already terminated. We therefore end up with:

MERGE Customer        AS [Target]
USING StagingCustomer AS [Source]
   ON Target.Email = Source.Email
WHEN MATCHED AND
    (
         Target.FirstName <> Source.FirstName
      OR Target.LastName  <> Source.LastName
      OR Target.Title     <> Source.Title
      OR Target.DoB       <> Source.DoB
    )
    THEN UPDATE SET
       FirstName = Source.FirstName
      ,LastName  = Source.LastName
      ,Title     = Source.Title
      ,DoB       = Source.DoB
      ,DTUpdated = GETDATE()
WHEN NOT MATCHED BY TARGET
    THEN INSERT (
        Email
       ,FirstName
       ,LastName
       ,Title
       ,DoB
       ,IsActive
       ,DTInserted
       ,DTUpdated
      ) VALUES (
        Source.Email
       ,Source.FirstName
       ,Source.LastName
       ,Source.Title
       ,Source.DoB
       ,1
       ,GETDATE()
       ,GETDATE()
      )
WHEN NOT MATCHED BY SOURCE
    AND Target.IsActive=1
    THEN UPDATE SET
        IsActive = 0
       ,DTUpdated = GETDATE()
      ;

Which leaves our Customer table looking like this

There are many variations and uses for this, we’ll explore its use in data warehouse dimension loading over the next couple of posts.

For now, Frog-Blog out.

[UPDATE] The next blog post in this series is now available: using T-SQL Merge to load Type 2 SCDs into a data warehouse dimension.

The Frog Blog

I'm Alex Whittles.

I specialise in designing and implementing SQL Server business intelligence solutions, and this is my blog! Just a collection of thoughts, techniques and ramblings on SQL Server, Cubes, Data Warehouses, MDX, DAX and whatever else comes to mind.

Data Platform MVP

Frog Blog Out
twitter
rssicon