This is the 3rd post in the Frog-Blog series on the awesomeness of T-SQL Merge.

In this post we’ll be looking at how we can automate the creation of the merge statement to reduce development time and improve reliability and flexibility of the ETL process. I discussed this in the 2nd half of a talk I gave at the UK technical launch of SQL Server 2012 at SQLBits X. Thank you to the great audience who came to that talk, this post is for your benefit and is a result of the feedback and requests from you guys.

Why automate merge?

As we saw in the previous post, merge is an incredibly powerful tool when loading data into data warehouse dimensions (specifically SCDs – slowly changing dimensions). The whole process can be wrapped up into a very neat stored proc which can save a considerable mount of time writing the equivalent functionality in SSIS. In the next installment of this series I’ll be discussing the performance of it compared to other methods of loading SCDs in SSIS (take a look at the SQLBits talk video [when it’s released] for a preview!). Suffice to say for now that in my [pretty comprehensive] tests it’s one of the fastest methods of loading SCDs.

If you missed the talk, you can download the slide deck here whilst you’re waiting for the video.

The problem that stops a lot of people using merge is the perceived complexity of the statement. It can be very easy to get things wrong, with pretty bad consequences on your dimension data.

The easiest way to avoid this complexity and simplify the process is to not write merge statements, but let an automated procedure to it for you – Simples!.

The other huge benefit is that, as we’ll see during this post, you can base the automation procedure on metadata, meaning that you can change the SCD functionality of your dimensions just by changing metadata, and not rewriting your code.

Note that in this post we’ll just be looking at Type 0 and 1 SCDs, not 2, 3 or 6. This is to keep things simple. Once you’ve mastered type 0 and 1, it’s a logical next step to expand things to deal with type 2s.

OK, so how do we do this?

First of all we need to set up two tables to use. Let’s create a simple Customer dimension. Alongside this we also need a staging table. I’m a big fan of using schemas to differentiate tables, so we’ll create dim.Customer and etl.Customer as our two tables.

CREATE SCHEMA [dim] AUTHORIZATION [dbo]
GO
CREATE SCHEMA [etl] AUTHORIZATION [dbo]
GO

CREATE TABLE [dim].[Customer](
    [CustomerKey]   [int] IDENTITY(1,1) NOT NULL,
    [Email]         [varchar](255)      NOT NULL,
    [FirstName]     [varchar](50)       NOT NULL,
    [LastName]      [varchar](50)       NOT NULL,
    [DoB]           [date]              NOT NULL,
    [Sex]           [char](1)           NOT NULL,
    [MaritalStatus] [varchar](10)       NOT NULL,
    [FirstCreated]  [date]              NOT NULL,
    [IsRowCurrent]  [bit]               NOT NULL,
    [ValidFrom]     [datetime]          NOT NULL,
    [ValidTo]       [datetime]          NOT NULL,
    [LastUpdated]   [datetime]          NOT NULL
 CONSTRAINT [PK_DimCustomer] PRIMARY KEY CLUSTERED 
(
	[CustomerKey] ASC
))
GO

CREATE TABLE [etl].[Customer](
    [Email]         [varchar](255)  NOT NULL,
    [FirstName]     [varchar](50)   NOT NULL,
    [LastName]      [varchar](50)   NOT NULL,
    [DoB]           [date]          NOT NULL,
    [Sex]           [char](1)       NOT NULL,
    [MaritalStatus] [varchar](10)   NOT NULL,
    [FirstCreated]  [date]          NOT NULL
)

So the dim table contains our primary surrogate key, business key (email address in this case), customer details and a series of audit fields (IsRowCurrent, ValidFrom, etc.). The etl staging table only contains the business key and customer details.

We then need to store the details of each field. i.e. how should each field be interpreted – is it a primary key, business, key, type 0 or 1, or an audit field. We need this so that we can put the correct fields into the correct place in the merge statement. You could create a table to store this information, however I prefer to use the extended properties of the fields.

EXEC sys.sp_addextendedproperty @level2name=N'CustomerKey',  @value=N'PK' ,    
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'Email',        @value=N'BK' ,    
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'FirstName',    @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'LastName',     @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'DoB',          @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'Sex',          @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'MaritalStatus',@value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'FirstCreated', @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'ValidFrom',    @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'ValidTo',      @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'IsRowCurrent', @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'LastUpdated',  @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'

Or you can obviously just enter the extended property manually using SSMS

The SSIS package should output all customer records into the etl table, with no regard for whether they are new customers, old customers, changed or not. The merge statement will take care of that.

Automating Merge

The first stage is to examine the structure of merge.

   MERGE   [DIMENSION TABLE]  as Target
   USING   [STAGING TABLE]    as Source
      ON   [LIST OF BUSINESS KEY FIELDS]
   WHEN MATCHED AND
         Target.[LIST OF TYPE 1 FIELDS] <> Source.[LIST OF TYPE 1 FIELDS]
      THEN UPDATE SET
         [LIST OF TYPE 1 FIELDS] = Source.[LIST OF TYPE 1 FIELDS]
   WHEN NOT MATCHED THEN INSERT
         [LIST OF ALL FIELDS]
      VALUES
         Source.[LIST OF ALL FIELDS]

The text in black is the skeleton of the statement, with the text in red being the details specific to the dimension. It’s these red items which we need to retrieve from the metadata of the dimension in order to create the full merge statement.

We can retrieve the extended properties using the sys.extended_properties DMV. This allows us to pull out a list of all fields which have a specific extended property set, e.g. all PK fields, all BK fields, all type 2 fields etc. etc. If we then put a few of these queries into cursors, we can loop through them and build up a dynamic SQL query. Yes I know, dynamic SQL should be avoided and is evil etc., however… this use is an exception and does truly make the World a better place.

I’m not going to explain the resulting proc in minute detail, so instead please just download it here and work through it yourself. I will however explain a couple of items which are pretty important:

It’s important to keep the naming convention of your dimensions consistent. This doesn’t mean that every dimension must be identical, some may need inferred member support, some may need type 2 tracking fields (e.g. IsRowCurrent) and some may not; the critical thing is that all of your fields, if they do exist, should be named consistently. The automation proc can then look for specific field names and include them in the merge statement if necessary.

There is a parameter in the proc called @Execute. This offers the possibility of either executing the resulting merge statement directly, or just printing out the statement. If you only want to use this to automate the development process then this allows you to do just that, you can then just copy and paste the resulting statement into SSIS or into a stored proc.

Result

The automated generation of T-SQL merge statement to handle type 0 & 1 SCDs!
Hopefully you can see how you can expand this to also cope with Type 2 SCDs, following the structure in my earlier posts.

Download the SQL scripts here
 
Frog-Blog Out

Tags: , , , ,