This is the 3rd post in the Frog-Blog series on the awesomeness of T-SQL Merge.
- Post 1: Introduction to T-SQL merge basics
- Post 2: Using T-SQL merge to load data warehouse dimensions
In this post we’ll be looking at how we can automate the creation of the merge statement to reduce development time and improve reliability and flexibility of the ETL process. I discussed this in the 2nd half of a talk I gave at the UK technical launch of SQL Server 2012 at SQLBits X. Thank you to the great audience who came to that talk, this post is for your benefit and is a result of the feedback and requests from you guys.
Why automate merge?
As we saw in the previous post, merge is an incredibly powerful tool when loading data into data warehouse dimensions (specifically SCDs – slowly changing dimensions). The whole process can be wrapped up into a very neat stored proc which can save a considerable mount of time writing the equivalent functionality in SSIS. In the next installment of this series I’ll be discussing the performance of it compared to other methods of loading SCDs in SSIS (take a look at the SQLBits talk video [when it’s released] for a preview!). Suffice to say for now that in my [pretty comprehensive] tests it’s one of the fastest methods of loading SCDs.
If you missed the talk, you can download the slide deck here whilst you’re waiting for the video.
The problem that stops a lot of people using merge is the perceived complexity of the statement. It can be very easy to get things wrong, with pretty bad consequences on your dimension data.
The easiest way to avoid this complexity and simplify the process is to not write merge statements, but let an automated procedure to it for you – Simples!.
The other huge benefit is that, as we’ll see during this post, you can base the automation procedure on metadata, meaning that you can change the SCD functionality of your dimensions just by changing metadata, and not rewriting your code.
Note that in this post we’ll just be looking at Type 0 and 1 SCDs, not 2, 3 or 6. This is to keep things simple. Once you’ve mastered type 0 and 1, it’s a logical next step to expand things to deal with type 2s.
OK, so how do we do this?
First of all we need to set up two tables to use. Let’s create a simple Customer dimension. Alongside this we also need a staging table. I’m a big fan of using schemas to differentiate tables, so we’ll create dim.Customer and etl.Customer as our two tables.
CREATE SCHEMA [dim] AUTHORIZATION [dbo] GO CREATE SCHEMA [etl] AUTHORIZATION [dbo] GO CREATE TABLE [dim].[Customer]( [CustomerKey] [int] IDENTITY(1,1) NOT NULL, [Email] [varchar](255) NOT NULL, [FirstName] [varchar](50) NOT NULL, [LastName] [varchar](50) NOT NULL, [DoB] [date] NOT NULL, [Sex] [char](1) NOT NULL, [MaritalStatus] [varchar](10) NOT NULL, [FirstCreated] [date] NOT NULL, [IsRowCurrent] [bit] NOT NULL, [ValidFrom] [datetime] NOT NULL, [ValidTo] [datetime] NOT NULL, [LastUpdated] [datetime] NOT NULL CONSTRAINT [PK_DimCustomer] PRIMARY KEY CLUSTERED ( [CustomerKey] ASC )) GO CREATE TABLE [etl].[Customer]( [Email] [varchar](255) NOT NULL, [FirstName] [varchar](50) NOT NULL, [LastName] [varchar](50) NOT NULL, [DoB] [date] NOT NULL, [Sex] [char](1) NOT NULL, [MaritalStatus] [varchar](10) NOT NULL, [FirstCreated] [date] NOT NULL )
So the dim table contains our primary surrogate key, business key (email address in this case), customer details and a series of audit fields (IsRowCurrent, ValidFrom, etc.). The etl staging table only contains the business key and customer details.
We then need to store the details of each field. i.e. how should each field be interpreted – is it a primary key, business, key, type 0 or 1, or an audit field. We need this so that we can put the correct fields into the correct place in the merge statement. You could create a table to store this information, however I prefer to use the extended properties of the fields.
EXEC sys.sp_addextendedproperty @level2name=N'CustomerKey', @value=N'PK' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'Email', @value=N'BK' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'FirstName', @value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'LastName', @value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'DoB', @value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'Sex', @value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'MaritalStatus',@value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'FirstCreated', @value=N'1' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'ValidFrom', @value=N'Audit' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'ValidTo', @value=N'Audit' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'IsRowCurrent', @value=N'Audit' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN' EXEC sys.sp_addextendedproperty @level2name=N'LastUpdated', @value=N'Audit' , @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
Or you can obviously just enter the extended property manually using SSMS
The SSIS package should output all customer records into the etl table, with no regard for whether they are new customers, old customers, changed or not. The merge statement will take care of that.
Automating Merge
The first stage is to examine the structure of merge.
MERGE [DIMENSION TABLE] as Target USING [STAGING TABLE] as Source ON [LIST OF BUSINESS KEY FIELDS] WHEN MATCHED AND Target.[LIST OF TYPE 1 FIELDS] <> Source.[LIST OF TYPE 1 FIELDS] THEN UPDATE SET [LIST OF TYPE 1 FIELDS] = Source.[LIST OF TYPE 1 FIELDS] WHEN NOT MATCHED THEN INSERT [LIST OF ALL FIELDS] VALUES Source.[LIST OF ALL FIELDS]
The text in black is the skeleton of the statement, with the text in red being the details specific to the dimension. It’s these red items which we need to retrieve from the metadata of the dimension in order to create the full merge statement.
We can retrieve the extended properties using the sys.extended_properties DMV. This allows us to pull out a list of all fields which have a specific extended property set, e.g. all PK fields, all BK fields, all type 2 fields etc. etc. If we then put a few of these queries into cursors, we can loop through them and build up a dynamic SQL query. Yes I know, dynamic SQL should be avoided and is evil etc., however… this use is an exception and does truly make the World a better place.
I’m not going to explain the resulting proc in minute detail, so instead please just download it here and work through it yourself. I will however explain a couple of items which are pretty important:
It’s important to keep the naming convention of your dimensions consistent. This doesn’t mean that every dimension must be identical, some may need inferred member support, some may need type 2 tracking fields (e.g. IsRowCurrent) and some may not; the critical thing is that all of your fields, if they do exist, should be named consistently. The automation proc can then look for specific field names and include them in the merge statement if necessary.
There is a parameter in the proc called @Execute. This offers the possibility of either executing the resulting merge statement directly, or just printing out the statement. If you only want to use this to automate the development process then this allows you to do just that, you can then just copy and paste the resulting statement into SSIS or into a stored proc.
Result
The automated generation of T-SQL merge statement to handle type 0 & 1 SCDs!
Hopefully you can see how you can expand this to also cope with Type 2 SCDs, following the structure in my earlier posts.
Download the SQL scripts here
Frog-Blog Out
Hi Alex,
Thanks a lot for the code and ideas. Explanation is fantastic. But I could not figure out what is the use of the extended properties in this. Will they be a part of the SP ?
I guess you used them instead of populating the businesskey, SCD type 0 columns and SCD type 1 in a separate table. But I cannot understand it completely. Please try to explain …
Regards,
Radhey
Hi Radhey,
Thanks for your comment. Exactly as you said, I use extended properties to save having to have a separate configuration table. They need setting up once on deployment, and then the SP just queries them to understand what it should be doing. A table would work equally well, however I’m reluctant to add a table which would clutter up the data warehouse unnecessarily. Extended properties are a nice behind the scenes way of adding metadata, however that may just be my preference – I also use them to store comments and documentation. Feel free to store the configuration anywhere that suits your requirements. Just change the cursor queries accordingly and it should all work the same.
Alex
Dear Alex,
Thanks for the detailed explanation. I really appreciate that. I have tried SCD type 2 with the same method – dynamic merge. I succeeded with your inputs.
Thanks a lot for the code.
Regards,
Radhey
Care to share that code?
Hi,
I am in the process of updating my Cube/ETL to use this code, it is very clean and professional.
One problem that I have is from the looks of it is that you have to have the same name staging table and dimension (dim.Customer and etl.Customer).
In my ETL package, we receive a CSV file that contains a data dump (200+ columns with about 4m records); I dump this into a staging table.
My staging table is dbo.DWUSD_LIVE and one of my dimensions is dim.Company
Can the script look at dim.Company columns and go into dbo.DWUSD_LIVE (where i have 200+ columns) and only select the ones it needs (SELECT DISTINCT)?
Thanks.
I was able to change the source/staging table to be different by…
adding:
@ETLTable varchar(50),
changing:
SET @SQL = @SQL + @crlf + ‘ USING [‘ + @ETLSchema + ‘].[‘ + @ETLTable + ‘] AS Source’
Is there a way you can incorporate GROUP BY to be generated in your TSQL output, so that the source system only distinct rows are returned?
Hi Mirza
The problem with trying to automatically incorporate a GROUP BY statement is knowing which columns to group by, and how to aggregate the remaining columns (max, avg, sum) etc. You’d have to add further metadata configuration to control this.
I’d question whether this is taking automated code generation too far – if it’s for a one off table, then just write the merge statement manually. If you do indeed have many tables which follow the same pattern, then the effort to do through code automation may be worthwhile, but it’s not a quick, simple tweak I’m afraid!
Alex
If you want to create and maintain MERGE statement for SCD quickly, use http://scdmergewizard.codeplex.com/
cheers
Just caught your SQLBits X presentation…very interesting use of the extendedproperty to dynamically generate the merge statement – thanks for sharing.
Thanks for the feedback Bill, glad it’s of some use!
What happens when your staging tables are named differently from your dimension tables? I have staging tables that are in a stage schema and are named stgF00005, etc. They are not named the same as the dim tables. What do you do then?
Hi Brian
You wouldn’t want to merge from staging directly into a dim, there’s always a layer of translation required first.
You usually run a query against staging, plus further transformation within SSIS, ADF, or other ETL tool. This results in a dataset that should look like your dim. Save this into a table (this could also be called staging, but is very different than the previous staging!). Then the merge can take care of blending this into the dim, but at this stage the [2nd] staging table should be the same structure as your dim.
Alex
Hi Alex,
Great article, thank you for sharing it. I am thinking of using that approach to load dimensions in Azure Synapse (SQL DW), what do you think on that? It seems like Synapse currently is not supporting extended properties. You mentioned that we can use the configuration table instead. Would you mind giving me some more details on what the structure should be and how to use it in SP? Another question is what would be the dynamic approach if our source and dimension columns have different names (like modified_date in source and src_modified_date in dimension)? Thank you in advance.
Hi Alex.
There are two main caveats to consider when trying to use Merge against Synapse.
First is the extended properties as you’ve mentioned. We’ve created a Reference Table that contains the Extended Properties for each of our merge destination tables. This contains the following for every column and every table, (plus some additional logging details):
SchemaName, TableName, ColumnName, ColumnType, SourceTable, SourceColumn, SCD, Description
Using this you can query this table to build your Merge Statement, instead of the extended properties of the columns.
Second is Merge itself in Synapse. As of Jan 2021, Merge is still in preview in Synapse, meaning it’s full functionality is not currently in place. The key part of this to consider is the table replication types that you’ve configured within your DW. With the caveats of size, functionality and some other factors, the majority of your dimension tables should end up as replicated tables, to aid speed of querying and functionality at the other end of the process. Unfortunately MERGE WHEN NOT MATCHED BY TARGET is only supported for HASH replication on Target tables. This means currently you’d need to use Insert and Update with manually matching logic, rather than a full Merge as you’d like. It’s not currently clear if this is going to be changed once Merge is out of preview, but whether it is worth waiting to see, has to be left to you and your timelines.
If you go to the following URL you can see more about the Merge limitations in Synapse, and also some of the specific syntax that needs to be used. https://docs.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver15#remarks
Reiss @ PF