This is the 3rd post in the Frog-Blog series on the awesomeness of T-SQL Merge.

In this post we’ll be looking at how we can automate the creation of the merge statement to reduce development time and improve reliability and flexibility of the ETL process. I discussed this in the 2nd half of a talk I gave at the UK technical launch of SQL Server 2012 at SQLBits X. Thank you to the great audience who came to that talk, this post is for your benefit and is a result of the feedback and requests from you guys.

Why automate merge?

As we saw in the previous post, merge is an incredibly powerful tool when loading data into data warehouse dimensions (specifically SCDs – slowly changing dimensions). The whole process can be wrapped up into a very neat stored proc which can save a considerable mount of time writing the equivalent functionality in SSIS. In the next installment of this series I’ll be discussing the performance of it compared to other methods of loading SCDs in SSIS (take a look at the SQLBits talk video [when it's released] for a preview!). Suffice to say for now that in my [pretty comprehensive] tests it’s one of the fastest methods of loading SCDs.

If you missed the talk, you can download the slide deck here whilst you’re waiting for the video.

The problem that stops a lot of people using merge is the perceived complexity of the statement. It can be very easy to get things wrong, with pretty bad consequences on your dimension data.

The easiest way to avoid this complexity and simplify the process is to not write merge statements, but let an automated procedure to it for you – Simples!.

The other huge benefit is that, as we’ll see during this post, you can base the automation procedure on metadata, meaning that you can change the SCD functionality of your dimensions just by changing metadata, and not rewriting your code.

Note that in this post we’ll just be looking at Type 0 and 1 SCDs, not 2, 3 or 6. This is to keep things simple. Once you’ve mastered type 0 and 1, it’s a logical next step to expand things to deal with type 2s.

OK, so how do we do this?

First of all we need to set up two tables to use. Let’s create a simple Customer dimension. Alongside this we also need a staging table. I’m a big fan of using schemas to differentiate tables, so we’ll create dim.Customer and etl.Customer as our two tables.

CREATE SCHEMA [dim] AUTHORIZATION [dbo]
GO
CREATE SCHEMA [etl] AUTHORIZATION [dbo]
GO

CREATE TABLE [dim].[Customer](
    [CustomerKey]   [int] IDENTITY(1,1) NOT NULL,
    [Email]         [varchar](255)      NOT NULL,
    [FirstName]     [varchar](50)       NOT NULL,
    [LastName]      [varchar](50)       NOT NULL,
    [DoB]           [date]              NOT NULL,
    [Sex]           [char](1)           NOT NULL,
    [MaritalStatus] [varchar](10)       NOT NULL,
    [FirstCreated]  [date]              NOT NULL,
    [IsRowCurrent]  [bit]               NOT NULL,
    [ValidFrom]     [datetime]          NOT NULL,
    [ValidTo]       [datetime]          NOT NULL,
    [LastUpdated]   [datetime]          NOT NULL
 CONSTRAINT [PK_DimCustomer] PRIMARY KEY CLUSTERED 
(
	[CustomerKey] ASC
))
GO

CREATE TABLE [etl].[Customer](
    [Email]         [varchar](255)  NOT NULL,
    [FirstName]     [varchar](50)   NOT NULL,
    [LastName]      [varchar](50)   NOT NULL,
    [DoB]           [date]          NOT NULL,
    [Sex]           [char](1)       NOT NULL,
    [MaritalStatus] [varchar](10)   NOT NULL,
    [FirstCreated]  [date]          NOT NULL
)

So the dim table contains our primary surrogate key, business key (email address in this case), customer details and a series of audit fields (IsRowCurrent, ValidFrom, etc.). The etl staging table only contains the business key and customer details.

We then need to store the details of each field. i.e. how should each field be interpreted – is it a primary key, business, key, type 0 or 1, or an audit field. We need this so that we can put the correct fields into the correct place in the merge statement. You could create a table to store this information, however I prefer to use the extended properties of the fields.

EXEC sys.sp_addextendedproperty @level2name=N'CustomerKey',  @value=N'PK' ,    
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'Email',        @value=N'BK' ,    
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'FirstName',    @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'LastName',     @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'DoB',          @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'Sex',          @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'MaritalStatus',@value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'FirstCreated', @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'ValidFrom',    @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'ValidTo',      @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'IsRowCurrent', @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'LastUpdated',  @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'

Or you can obviously just enter the extended property manually using SSMS

The SSIS package should output all customer records into the etl table, with no regard for whether they are new customers, old customers, changed or not. The merge statement will take care of that.

Automating Merge

The first stage is to examine the structure of merge.

   MERGE   [DIMENSION TABLE]  as Target
   USING   [STAGING TABLE]    as Source
      ON   [LIST OF BUSINESS KEY FIELDS]
   WHEN MATCHED AND
         Target.[LIST OF TYPE 1 FIELDS] <> Source.[LIST OF TYPE 1 FIELDS]
      THEN UPDATE SET
         [LIST OF TYPE 1 FIELDS] = Source.[LIST OF TYPE 1 FIELDS]
   WHEN NOT MATCHED THEN INSERT
         [LIST OF ALL FIELDS]
      VALUES
         Source.[LIST OF ALL FIELDS]

The text in black is the skeleton of the statement, with the text in red being the details specific to the dimension. It’s these red items which we need to retrieve from the metadata of the dimension in order to create the full merge statement.

We can retrieve the extended properties using the sys.extended_properties DMV. This allows us to pull out a list of all fields which have a specific extended property set, e.g. all PK fields, all BK fields, all type 2 fields etc. etc. If we then put a few of these queries into cursors, we can loop through them and build up a dynamic SQL query. Yes I know, dynamic SQL should be avoided and is evil etc., however… this use is an exception and does truly make the World a better place.

I’m not going to explain the resulting proc in minute detail, so instead please just download it here and work through it yourself. I will however explain a couple of items which are pretty important:

It’s important to keep the naming convention of your dimensions consistent. This doesn’t mean that every dimension must be identical, some may need inferred member support, some may need type 2 tracking fields (e.g. IsRowCurrent) and some may not; the critical thing is that all of your fields, if they do exist, should be named consistently. The automation proc can then look for specific field names and include them in the merge statement if necessary.

There is a parameter in the proc called @Execute. This offers the possibility of either executing the resulting merge statement directly, or just printing out the statement. If you only want to use this to automate the development process then this allows you to do just that, you can then just copy and paste the resulting statement into SSIS or into a stored proc.

Result

The automated generation of T-SQL merge statement to handle type 0 & 1 SCDs!
Hopefully you can see how you can expand this to also cope with Type 2 SCDs, following the structure in my earlier posts.

Download the SQL scripts here
 
Frog-Blog Out

© Alex Whittles, Purple Frog Systems Ltd

10 Responses to “Automating T-SQL Merge to load Dimensions (SCD)”

  • Radhey:

    Hi Alex,
    Thanks a lot for the code and ideas. Explanation is fantastic. But I could not figure out what is the use of the extended properties in this. Will they be a part of the SP ?

    I guess you used them instead of populating the businesskey, SCD type 0 columns and SCD type 1 in a separate table. But I cannot understand it completely. Please try to explain …

    Regards,
    Radhey

    • Alex:

      Hi Radhey,
      Thanks for your comment. Exactly as you said, I use extended properties to save having to have a separate configuration table. They need setting up once on deployment, and then the SP just queries them to understand what it should be doing. A table would work equally well, however I’m reluctant to add a table which would clutter up the data warehouse unnecessarily. Extended properties are a nice behind the scenes way of adding metadata, however that may just be my preference – I also use them to store comments and documentation. Feel free to store the configuration anywhere that suits your requirements. Just change the cursor queries accordingly and it should all work the same.
      Alex

      • Radhey:

        Dear Alex,
        Thanks for the detailed explanation. I really appreciate that. I have tried SCD type 2 with the same method – dynamic merge. I succeeded with your inputs.

        Thanks a lot for the code.

        Regards,
        Radhey

  • Mirza:

    Hi,

    I am in the process of updating my Cube/ETL to use this code, it is very clean and professional.

    One problem that I have is from the looks of it is that you have to have the same name staging table and dimension (dim.Customer and etl.Customer).

    In my ETL package, we receive a CSV file that contains a data dump (200+ columns with about 4m records); I dump this into a staging table.

    My staging table is dbo.DWUSD_LIVE and one of my dimensions is dim.Company

    Can the script look at dim.Company columns and go into dbo.DWUSD_LIVE (where i have 200+ columns) and only select the ones it needs (SELECT DISTINCT)?

    Thanks.

  • Mirza:

    I was able to change the source/staging table to be different by…

    adding:
    @ETLTable varchar(50),

    changing:
    SET @SQL = @SQL + @crlf + ‘ USING [\' + @ETLSchema + \'].[\' + @ETLTable + \'] AS Source’

    Is there a way you can incorporate GROUP BY to be generated in your TSQL output, so that the source system only distinct rows are returned?

    • Alex:

      Hi Mirza

      The problem with trying to automatically incorporate a GROUP BY statement is knowing which columns to group by, and how to aggregate the remaining columns (max, avg, sum) etc. You’d have to add further metadata configuration to control this.

      I’d question whether this is taking automated code generation too far – if it’s for a one off table, then just write the merge statement manually. If you do indeed have many tables which follow the same pattern, then the effort to do through code automation may be worthwhile, but it’s not a quick, simple tweak I’m afraid!

      Alex

  • knyazs:

    If you want to create and maintain MERGE statement for SCD quickly, use http://scdmergewizard.codeplex.com/

    cheers

  • Bill Anton:

    Just caught your SQLBits X presentation…very interesting use of the extendedproperty to dynamically generate the merge statement – thanks for sharing.

Leave a Reply

The Frog Blog

I'm Alex Whittles.

I specialise in designing and implementing SQL Server business intelligence solutions, and this is my blog! Just a collection of thoughts, techniques and ramblings on SQL Server, Cubes, Data Warehouses, MDX, DAX and whatever else comes to mind.

Organising SQLRelay

Submit a session for SQLBits