0845 643 64 63

SQL Server

Automating T-SQL Merge to load Dimensions (SCD)

This is the 3rd post in the Frog-Blog series on the awesomeness of T-SQL Merge.

In this post we’ll be looking at how we can automate the creation of the merge statement to reduce development time and improve reliability and flexibility of the ETL process. I discussed this in the 2nd half of a talk I gave at the UK technical launch of SQL Server 2012 at SQLBits X. Thank you to the great audience who came to that talk, this post is for your benefit and is a result of the feedback and requests from you guys.

Why automate merge?

As we saw in the previous post, merge is an incredibly powerful tool when loading data into data warehouse dimensions (specifically SCDs – slowly changing dimensions). The whole process can be wrapped up into a very neat stored proc which can save a considerable mount of time writing the equivalent functionality in SSIS. In the next installment of this series I’ll be discussing the performance of it compared to other methods of loading SCDs in SSIS (take a look at the SQLBits talk video [when it’s released] for a preview!). Suffice to say for now that in my [pretty comprehensive] tests it’s one of the fastest methods of loading SCDs.

If you missed the talk, you can download the slide deck here whilst you’re waiting for the video.

The problem that stops a lot of people using merge is the perceived complexity of the statement. It can be very easy to get things wrong, with pretty bad consequences on your dimension data.

The easiest way to avoid this complexity and simplify the process is to not write merge statements, but let an automated procedure to it for you – Simples!.

The other huge benefit is that, as we’ll see during this post, you can base the automation procedure on metadata, meaning that you can change the SCD functionality of your dimensions just by changing metadata, and not rewriting your code.

Note that in this post we’ll just be looking at Type 0 and 1 SCDs, not 2, 3 or 6. This is to keep things simple. Once you’ve mastered type 0 and 1, it’s a logical next step to expand things to deal with type 2s.

OK, so how do we do this?

First of all we need to set up two tables to use. Let’s create a simple Customer dimension. Alongside this we also need a staging table. I’m a big fan of using schemas to differentiate tables, so we’ll create dim.Customer and etl.Customer as our two tables.

CREATE SCHEMA [dim] AUTHORIZATION [dbo]
GO
CREATE SCHEMA [etl] AUTHORIZATION [dbo]
GO

CREATE TABLE [dim].[Customer](
    [CustomerKey]   [int] IDENTITY(1,1) NOT NULL,
    [Email]         [varchar](255)      NOT NULL,
    [FirstName]     [varchar](50)       NOT NULL,
    [LastName]      [varchar](50)       NOT NULL,
    [DoB]           [date]              NOT NULL,
    [Sex]           [char](1)           NOT NULL,
    [MaritalStatus] [varchar](10)       NOT NULL,
    [FirstCreated]  [date]              NOT NULL,
    [IsRowCurrent]  [bit]               NOT NULL,
    [ValidFrom]     [datetime]          NOT NULL,
    [ValidTo]       [datetime]          NOT NULL,
    [LastUpdated]   [datetime]          NOT NULL
 CONSTRAINT [PK_DimCustomer] PRIMARY KEY CLUSTERED 
(
	[CustomerKey] ASC
))
GO

CREATE TABLE [etl].[Customer](
    [Email]         [varchar](255)  NOT NULL,
    [FirstName]     [varchar](50)   NOT NULL,
    [LastName]      [varchar](50)   NOT NULL,
    [DoB]           [date]          NOT NULL,
    [Sex]           [char](1)       NOT NULL,
    [MaritalStatus] [varchar](10)   NOT NULL,
    [FirstCreated]  [date]          NOT NULL
)

So the dim table contains our primary surrogate key, business key (email address in this case), customer details and a series of audit fields (IsRowCurrent, ValidFrom, etc.). The etl staging table only contains the business key and customer details.

We then need to store the details of each field. i.e. how should each field be interpreted – is it a primary key, business, key, type 0 or 1, or an audit field. We need this so that we can put the correct fields into the correct place in the merge statement. You could create a table to store this information, however I prefer to use the extended properties of the fields.

EXEC sys.sp_addextendedproperty @level2name=N'CustomerKey',  @value=N'PK' ,    
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'Email',        @value=N'BK' ,    
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'FirstName',    @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'LastName',     @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'DoB',          @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'Sex',          @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'MaritalStatus',@value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'FirstCreated', @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'ValidFrom',    @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'ValidTo',      @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'IsRowCurrent', @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'LastUpdated',  @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'

Or you can obviously just enter the extended property manually using SSMS

The SSIS package should output all customer records into the etl table, with no regard for whether they are new customers, old customers, changed or not. The merge statement will take care of that.

Automating Merge

The first stage is to examine the structure of merge.

   MERGE   [DIMENSION TABLE]  as Target
   USING   [STAGING TABLE]    as Source
      ON   [LIST OF BUSINESS KEY FIELDS]
   WHEN MATCHED AND
         Target.[LIST OF TYPE 1 FIELDS] <> Source.[LIST OF TYPE 1 FIELDS]
      THEN UPDATE SET
         [LIST OF TYPE 1 FIELDS] = Source.[LIST OF TYPE 1 FIELDS]
   WHEN NOT MATCHED THEN INSERT
         [LIST OF ALL FIELDS]
      VALUES
         Source.[LIST OF ALL FIELDS]

The text in black is the skeleton of the statement, with the text in red being the details specific to the dimension. It’s these red items which we need to retrieve from the metadata of the dimension in order to create the full merge statement.

We can retrieve the extended properties using the sys.extended_properties DMV. This allows us to pull out a list of all fields which have a specific extended property set, e.g. all PK fields, all BK fields, all type 2 fields etc. etc. If we then put a few of these queries into cursors, we can loop through them and build up a dynamic SQL query. Yes I know, dynamic SQL should be avoided and is evil etc., however… this use is an exception and does truly make the World a better place.

I’m not going to explain the resulting proc in minute detail, so instead please just download it here and work through it yourself. I will however explain a couple of items which are pretty important:

It’s important to keep the naming convention of your dimensions consistent. This doesn’t mean that every dimension must be identical, some may need inferred member support, some may need type 2 tracking fields (e.g. IsRowCurrent) and some may not; the critical thing is that all of your fields, if they do exist, should be named consistently. The automation proc can then look for specific field names and include them in the merge statement if necessary.

There is a parameter in the proc called @Execute. This offers the possibility of either executing the resulting merge statement directly, or just printing out the statement. If you only want to use this to automate the development process then this allows you to do just that, you can then just copy and paste the resulting statement into SSIS or into a stored proc.

Result

The automated generation of T-SQL merge statement to handle type 0 & 1 SCDs!
Hopefully you can see how you can expand this to also cope with Type 2 SCDs, following the structure in my earlier posts.

Download the SQL scripts here
 
Frog-Blog Out

Using T-SQL Merge to load Data Warehouse dimensions

In my last blog post I showed the basic concepts of using the T-SQL Merge statement, available in SQL Server 2008 onwards.

In this post we’ll take it a step further and show how we can use it for loading data warehouse dimensions, and managing the SCD (slowly changing dimension) process. Before we start, let’s have a quick catch up on what an SCD is…

What is a Slowly Changing Dimension (SCD)?

If you want a full explanation of slowly changing dimensions then you’ve come to the wrong place, I’m assuming a moderate level of experience of SCDs here, check out Wikipedia for some background, but in short, they manage the tracking of attribute history in dimensional data warehouses.

Most data warehouses contain type 0, 1 and 2 SCDs, so we’ll cope with those for now.

  • Type 0 – Ignore updates
  • Type 1 – Only keep latest version
  • Type 2 – Track history by creating a new row

Type 2 is commonly stored in a fashion similar to this.

Both records show the same customer but in this case Jane got married and changed her name & title. We terminate the old record by setting IsRowCurrent=0 and create a new record with the new details. Each row also contains ValidFrom and ValidTo dates which allow us to identify the correct record for a particular point in time.

That’s enough of that, let’s get on with doing this using Merge

Using Merge to load SCD

The first stage is to save the output rows from the ETL process to a staging table. We can then use Merge to process these into the live dimension.

We saw in the previous post how to either insert or update a record depending on whether it already exists. We can start with this and enhance as we go. First lets figure out what logic we want to perform

  • If the record doesn’t exist, create it
  • If the record does exist
    • Type 0 fields – ignore
    • Type 1 fields – update fields
    • Type 2 fields – terminate existing record, insert a new record
  • If the record exists in the dimension, but not in the updated source file – terminate record

The last option is rarely used in my experience, as it only works when you perform a full load of the dimension every time. It’s more common to process an incremental load, but I’ve included it here for completeness.

The main difference here, over a basic upsert, is the handling of type 2s; we need to perform two separate operations on the dimension for every incoming record. Firstly we terminate the existing row then we have to insert a new row.

The T-SQL Merge statement can only update a single row per incoming row, but there’s a trick that we can take advantage of by making use of the OUTPUT clause. Merge can output the results of what it has done, which in turn can be consumed by a separate INSERT statement.

We’ll therefore use the MERGE statement to update the existing record, terminating it, and then pass the relevant source rows out to the INSERT statement to create the new row.

Let’s look at an example. Download the code here which will create the necessary tables and data to work on.

Main Merge Statement

We’ll start with a statement very similar to the previous post, with only a couple of minor amendments:

  • We include IsRowCurrent into the joining clause. We only ever want to update the current records, not the history.
  • DoB is removed from the WHEN MATCHED clause. We’re going to treat DoB as a type 1 change, if it’s updated then we assume it’s a correction rather than a new date of birth which should be tracked. We’ll deal with this Type 1 later on
  • The UPDATE statement in the WHEN MATCHED clause doesn’t change the fields, only terminates the row by setting the IsRowCurrent and ValidTo fields (as well as LastUpdated)
 MERGE Customer        AS [Target]
 USING StagingCustomer AS [Source]
    ON Target.Email        = Source.Email
   AND Target.IsRowCurrent = 1
 WHEN MATCHED AND
     (
          Target.FirstName <> Source.FirstName
       OR Target.LastName  <> Source.LastName
       OR Target.Title     <> Source.Title
     )
     THEN UPDATE SET
        IsRowCurrent     = 0
       ,LastUpdated      = GETDATE()
       ,ValidTo          = GETDATE()
 WHEN NOT MATCHED BY TARGET
     THEN INSERT (
         FirstName
        ,LastName
        ,Title
        ,DoB
        ,Email
        ,LastUpdated
        ,IsRowCurrent
        ,ValidFrom
        ,ValidTo
       ) VALUES (
         Source.FirstName
        ,Source.LastName
        ,Source.Title
        ,Source.DoB
        ,Source.Email
        ,GETDATE()      --LastUpdated
        ,1              --IsRowCurrent
        ,GETDATE()      --ValidFrom
        ,'9999-12-31'   --ValidTo
       )
 WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1
     THEN UPDATE SET
         IsRowCurrent = 0
        ,LastUpdated  = GETDATE()
        ,ValidTo      = GETDATE()

The ‘When Matched’ section includes extra clauses which define which fields should be treated as Type 2.

The ‘When Not Matched By Target’ section deals with inserting the new records which didn’t previously exist.

The ‘When Not Matched By Source’ section deals with terminating records which are no longer received from the source. Usually this section can be deleted, especially if the data is received incrementally.

*** UPDATE *** Thank you to Sergey (in the comments below) for pointing out an error in this code. I’ve now corrected the ‘WHEN NOT MATCHED BY SOURCE’ line to include ‘AND Target.IsRowCurrent=1’. If this is omitted then all historic (IsRowCurrent=0) records are always updated with today’s date. We only want to terminate current records, not already terminated records.

We then add an OUTPUT clause to the end of the statement

 OUTPUT $action AS Action
       ,Source.*

The OUTPUT clause tells MERGE to generate an output dataset. This can consist of any of the Source table’s fields or the Target table’s fields. We can also specify $Action as an extra field which will identify, for each row, whether

it was dealt with via an INSERT, UPDATE or DELETE. For this purpose we only care about the UPDATES, so we’ll use this to filter the records later on. We also only need the Source data, not the Target, so we’ll return Source.*

We wrap this up within an INSERT statement which will insert the new record for the changed dimension member.

INSERT INTO Customer
   ( FirstName
    ,LastName
    ,Title
    ,DoB
    ,Email
    ,LastUpdated
    ,IsRowCurrent
    ,ValidFrom
    ,ValidTo
   )
SELECT
     FirstName
    ,LastName
    ,Title
    ,DoB
    ,Email
    ,GETDATE()    --LastUpdated
    ,1            --IsRowCurrent
    ,GETDATE()    --ValidFrom
    ,'9999-12-31' --ValidTo
FROM (
  MERGE Customer        AS [Target]
  USING StagingCustomer AS [Source]
     ON Target.Email        = Source.Email
    AND Target.IsRowCurrent = 1
  WHEN MATCHED AND
      (
           Target.FirstName <> Source.FirstName
        OR Target.LastName  <> Source.LastName
        OR Target.Title     <> Source.Title
      )
      THEN UPDATE SET
         IsRowCurrent     = 0
        ,LastUpdated      = GETDATE()
        ,ValidTo          = GETDATE()
  WHEN NOT MATCHED BY TARGET
      THEN INSERT (
          FirstName
         ,LastName
         ,Title
         ,DoB
         ,Email
         ,LastUpdated
         ,IsRowCurrent
         ,ValidFrom
         ,ValidTo
        ) VALUES (
          Source.FirstName
         ,Source.LastName
         ,Source.Title
         ,Source.DoB
         ,Source.Email
         ,GETDATE()      --LastUpdated
         ,1              --IsRowCurrent
         ,GETDATE()      --ValidFrom
         ,'9999-12-31'   --ValidTo
        )
  WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1
      THEN UPDATE SET
          IsRowCurrent = 0
         ,LastUpdated  = GETDATE()
         ,ValidTo      = GETDATE()
  OUTPUT $action AS Action
        ,[Source].*
) AS MergeOutput
WHERE MergeOutput.Action = 'UPDATE'
  AND Email IS NOT NULL
;

Note that the output clause is restricted so we only return the ‘UPDATE’ rows. As we’re using the email field as the business key, we should also ensure that we only insert records which have a valid email address.

So Type 2 changes have now been dealt with, by terminating the old version of the record and inserting the new version. Type 0 fields are just left out of the entire process, so are taken care of by just ignoring them. Therefore the only thing left is to manage the Type 1 fields.

We have two options here;

  • Update all historical records to the new value
  • Update only the current record to the new value

These are obviously only valid when there is a mix of type 1 and 2 attributes. If we’re just looking at Type 1 then there will be no historical records. In a true Type 1 scenario the first option is correct. All history (of Type 1 fields) is lost. The second option can be a valid option when it would be beneficial to keep a limited history of Type 1 fields.

This would mean that historical records created by Type 2 changes also keep a record of the Type 1 attribute values that were valid at the time the record was terminated. It doesn’t keep a full history of Type 1 attributes but sometimes this can be useful.

  UPDATE C
      SET DoB         = SC.DoB
         ,LastUpdated = GETDATE()
  FROM Customer C
     INNER JOIN StagingCustomer SC
          ON C.Email        =  SC.Email
       --AND C.IsRowCurrent =  1      --Optional
         AND C.DoB          <> SC.DoB

This block of code updates the Type 1 attributes (in this case, DoB). The line 7 (the IsRowCurrent) check is optional depending on whether you only want to update current or all records.

So in one SQL statement we’ve managed the entire load process of all Type 2 SCDs, and with one more we’ve also managed all Type 1 fields.

I’ve been performing a large number of performance tests on loading Type 2s using various methods (another blog post to follow, as well as a talk that I’ll be presenting at SQL Bits X), and the performance of this method is very fast. In fact there’s very little difference in performance between using this method and using the SSIS Merge Join component.

This is now my preferred approach to loading Type 2 SCDs, slightly faster methods may be available, but as we’ll see in later blog posts, this is such a quick method to implement, as well as being incredibly flexible as it can be controlled entirely from metadata.

Long live the Merge statement!

Update: 9th Nov 2019

Added a follow-up post to demonstrate how to capture insert/update/delete counts as outputs from the merge statement, which you can read here.

Frog-Blog Out

Introduction to T-SQL Merge Basics

A number of Frog-Blog posts over the next couple of months are going to make heavy use of the awesome SQL Server MERGE statement, introduced in SQL Server 2008. I thought it best to write an introduction post to provide the basics and groundwork for future posts.

So what does the Merge statement do?

Merge upserts data into a table. If, like me, you find the word ‘upsert’ more than a little ridiculous, we can scrap it in favour of a slightly more sensible description; It both inserts and updates data at the same time. Much better.

How does it do it?

In a similar way in which we’d use an UPDATE statement, we need to provide a source table and a destination table. We then give it a method of matching rows between them, usually the primary key or business/natural key (this can use multiple fields).

We can then specify a number of actions depending on whether a match is found or not.

If a match is found we may want to update the existing record.

If a match isn’t found then it’s likely that we’ll want to insert a new record.

What’s the syntax?

First we set the tables and the joining key(s)

MERGE [Destination Table] AS Target
   USING [Source Table] AS Source
     ON Target.KeyField = Source.KeyField

Then we define what we want to happen if we find a match (optional)

  WHEN MATCHED

If you want you can specify other conditions, such as checking whether anything has changed

     AND  (
         Target.Field1 <> Source.Field1
         OR Target.Field2 <> Source.Field2
         )

We can then perform an action. i.e. an update, delete, etc.

  THEN UPDATE SET
        Field1 = Source.Field1
      , Field2 = Source.Field2

Then we define what we want to happen if we don’t find a matching record (optional)

   WHEN NOT MATCHED

In this case, we want to insert the new record

      THEN INSERT (
             KeyField
           , Field1
           , Field2
        ) VALUES (
             Source.KeyField
           , Source.Field1
           , Source.Field2
      );

As this is the last part of the statement we have to terminate the merge with a semi colon.

Time for an example

We’ve got a customer table, and we receive an updated set of customer details which we want to import into our table. Assuming we’ve imported the new details into a staging table, we can use merge to do the work for us.

Use this script to create the necessary tables & data.

Our existing Customer table should look like this

And our staging table (new customer data) should look like

Firstly lets define the logic that we’re looking for.

  • Our new customer data doesn’t include the CustomerID, so we can’t match on the primary key. Instead we’ll use the email address as the key.
  • If the customer already exists then we want to check whether the fields are different, if they are then update the customer record.
  • If the customer doesn’t exist then insert them as a new customer record.
  • The Customer table contains audit fields which must be maintained. DTInserted should contain the datetime when the record was first created. DTUpdated should be updated whenever any of the data in a row is changed.

This is all performed using the following statement

MERGE Customer        AS [Target]
USING StagingCustomer AS [Source]
   ON Target.Email = Source.Email
WHEN MATCHED AND
    (
         Target.FirstName <> Source.FirstName
      OR Target.LastName  <> Source.LastName
      OR Target.Title     <> Source.Title
      OR Target.DoB       <> Source.DoB
    )
    THEN UPDATE SET
       FirstName = Source.FirstName
      ,LastName  = Source.LastName
      ,Title     = Source.Title
      ,DoB       = Source.DoB
      ,DTUpdated = GETDATE()
WHEN NOT MATCHED
    THEN INSERT (
        Email
       ,FirstName
       ,LastName
       ,Title
       ,DoB
       ,DTInserted
       ,DTUpdated
      ) VALUES (
        Source.Email
       ,Source.FirstName
       ,Source.LastName
       ,Source.Title
       ,Source.DoB
       ,GETDATE()
       ,GETDATE()
      );

It’s a long statement and looks complex at first glance, but as soon as you break it down it’s remarkably simple and easy to write.

If you run this, you should see 4 rows affected – 2 new rows inserted (Jack & Mary) and 2 rows updated (Phil & Jane). The new data has been inserted/updated into the Customer table and the audit fields have been kept up to date.

Note however that (if you look at the data in the sample scripts) Sarah & Andy exist in the Customer table but are missing from the new customer list in the update file. With the code above they just stay as they are in the Customer table, we assume they haven’t changed. What if we wanted to identify these customers and flag them as inactive or even delete them?

If we look at the ‘WHEN NOT MATCHED’ line, what this is actually doing is defaulting to ‘WHEN NOT MATCHED BY TARGET’. i.e. if you don’t find the record in the target table then do something.
We can add an extra section for ‘WHEN NOT MATCHED BY SOURCE’ which will allow us to specify an action when a row in our Customer table doesn’t have a matching record in the source staging data.

We can add:

WHEN NOT MATCHED BY SOURCE
    THEN DELETE

Which will simply delete the row in the Customer table, or we could update the record and terminate it using something like this.

WHEN NOT MATCHED BY SOURCE
    THEN UPDATE SET
        IsActive = 0

Don’t forget that the entire merge statement needs to be terminated with a semi colon, and we should also update the audit field. We also only want to update the record once, so check that the customer isn’t already terminated. We therefore end up with:

MERGE Customer        AS [Target]
USING StagingCustomer AS [Source]
   ON Target.Email = Source.Email
WHEN MATCHED AND
    (
         Target.FirstName <> Source.FirstName
      OR Target.LastName  <> Source.LastName
      OR Target.Title     <> Source.Title
      OR Target.DoB       <> Source.DoB
    )
    THEN UPDATE SET
       FirstName = Source.FirstName
      ,LastName  = Source.LastName
      ,Title     = Source.Title
      ,DoB       = Source.DoB
      ,DTUpdated = GETDATE()
WHEN NOT MATCHED BY TARGET
    THEN INSERT (
        Email
       ,FirstName
       ,LastName
       ,Title
       ,DoB
       ,IsActive
       ,DTInserted
       ,DTUpdated
      ) VALUES (
        Source.Email
       ,Source.FirstName
       ,Source.LastName
       ,Source.Title
       ,Source.DoB
       ,1
       ,GETDATE()
       ,GETDATE()
      )
WHEN NOT MATCHED BY SOURCE
    AND Target.IsActive=1
    THEN UPDATE SET
        IsActive = 0
       ,DTUpdated = GETDATE()
      ;

Which leaves our Customer table looking like this

There are many variations and uses for this, we’ll explore its use in data warehouse dimension loading over the next couple of posts.

For now, Frog-Blog out.

[UPDATE] The next blog post in this series is now available: using T-SQL Merge to load Type 2 SCDs into a data warehouse dimension.

Fast record count in SQL Server using sp_spaceused

Just a quick tidbit today, but a very useful one.
Ever need to do SELECT Count(*) on a very large table? It can take ages.

Instead, use sp_spaceused

EXEC sp_spaceused 'MyTable'

It returns you a useful result set like

name rows reserved data index_size unused
MyTable 19690290 13637488 KB 3001688 KB 9986416 KB 649384 KB

Not only does it also give you extra useful information about the table, but it comes back virtually instantly.

Columns:
  name – name of the object requested
  rows – the number of rows in the table (or service broker queue)
  reserved – the total reserved space for the object
  data – amount of space used by the data
  index_size – amount of spaced used by indexes in the object
  unused – total space reserved for the object but not yet used

You can also use sp_spaceused without any parameters, and it will return you the stats for the whole database.

Frog-Blog Out

SQLBits 9 Session – PowerPivot and QlikView 101

Thank you to the SQLBits committee, speakers, helpers and attendees, for making SQLBits 9 one of the best yet. What a great example SQLBits is of the power of the SQL Server community coming together to help each other, and enjoy a few beers in the process!

QlikView PowerPivotI was delighted to have my “PowerPivot & QlikView 101” session chosen by the committee, which I presented on Saturday afternoon. Thank you to all those who attended, great to see such a good number there. If you haven’t yet submitted your feedback for my session, and the event as a whole, please do take the time to do it. It helps everyone improve the quality of the event for next time.

For those who atttended my session, here are the slides, with a summary of the results and code etc.

Slides: Download the PowerPivot & QlikView 101 slide deck here

QlikView Script: Download the QlikView Script here

PowerPivot Workboook: Download the PowerPivot workbook here

Session Video: The session recording isn’t yet available, I’ll post a link here when it’s ready.

QlikView Expressions

Sales Amount Visual Cue for the green shading
   Upper: (SUM(TOTAL SalesAmount) / COUNT(TOTAL SalesTerritoryRegion))*1.2
   Lower: (SUM(TOTAL SalesAmount) / COUNT(TOTAL SalesTerritoryRegion))*0.8

Sales Amount blue bar Guage maximum
   max(ALL Aggr(SUM(SalesAmount),OrderMonth))

Sales Amount Year to Date
   rangesum(above(sum(SalesAmount),0,12))

Sales Amount Previous Year
   SUM({<OrderYear={‘$(=Max(OrderYear)-1)’}>} SalesAmount)

PowerPivot DAX Expressions

Sales Amount Year to Date
   =TOTALYTD(SUM(FactInternetSales[SalesAmount]), DimOrderDate[FullDateAlternateKey], ALL(DimOrderDate))

Sales Amount Previous Year
   =CALCULATE(SUM(FactInternetSales[SalesAmount]), SAMEPERIODLASTYEAR(DimOrderDate[FullDateAlternateKey]), ALL(DimOrderDate))

 

PowerPivot Results

PowerPivot Dashboard

 

QlikView Results

QlikView Dashboard

If you have any questions about any of this, please get in touch.

Frog-Blog-Out

New UK SQL Server User Groups

What an exciting few weeks we’ve got to look forward to!

SQL Bits starts on Thursday this week which promises to be another fantastic event. I’m speaking on the Saturday, presenting a session on QlikView and PowerPivot. When I’m not speaking I’ve got a list of duties as long as my arm, from marshelling to room monitoring, and of course representing the Community Corner.

If you’re interested in SQL events local to you, then come and say hi to us on the Community Coener stand – a number of user group leaders in the UK will be on hand to provide any info you want regarding the SQL community, and point you towards events happening in your area.

After SQLBits, we’ve got the SQL Relay to look forward to. Every SQL Server user group in the UK is putting on an event between 3rd and 5th October 2011, with a grand finale event in London at Microsoft’s Cardinal Place on the 6th. Every event has great speakers, from MVPs to Microsoft employees, and in London we have renowned author and blogger Itzik Ben-Gan. We also have a host of fantastic prizes to give away at every event.

What makes all of this even better is that there are a number of new user groups launching during this week, including Maidenhead, Cambridge and Essex. There’s a group near you so head over to www.SQLServerFAQ.com and get registered!

Look forward to seeing you at one of the events!

SQL User Groups and SQLBits

SQLBits 9

Submit a session for SQLBits

It only seems like yesterday since the SQLBits crew put on their awesome 8th conference in Brighton. The next conference has just been announced, SQLBits will move to Liverpool for SQLBits 9 – “Query Across The Mersey” (…Really?!) between September 29th and October 1st.

Registration’s not yet open, but you can find out more at www.SQLBits.com

I’ve already submitted two sessions to present, “PowerPivot & QlikView 101” and “Advanced data modelling for data warehousing and cubes”. Make sure you vote for them when voting opens! You can read full summaries of all sessions submitted so far here.

SQL User Groups

We’ve also got a busy time coming up with the UK SQL Server user groups. I’ll be running the next Birmingham user group the week after SQLBits, on Monday 3rd October. The agenda and speakers have yet to be set, but it should be a goodun!

I’ll also be speaking at the Southampton user group in 2 weeks time, on Wednesday 6th July. I’ll be presenting my 3 part blog post on automating SSAS cube documentation. Matt Whitfield will also be presenting, so it should be a good night. Now stop reading this and go and register!!

SQL Server Art using Spatial Data

This post follows on from an earlier post on drawing with SQL Server, and explains how to create much more complex drawings using a couple of neat tricks and SQL Server spatial data.

Firstly, apologies to those at my session at SQL Bits to whom I promised this blog post, I did say I’d try and get it posted in a week, and it’s been a month – but it’s here now!

So, what are we trying to do? In my earlier post I demonstrated how to recreate a block drawing by tracing around the points on the edges and converting the results to SQL spatial data coordinates. This is ok if the image is a simple logo, but what if it’s too complex like a photo or sketch? It would take an age to trace so we need a more automated approach.

At this point I’ll make my second apology, to Simon Sabin, who must by now think that Alastair Aitchison and I are stalking him. This post (and my associated lightning talk at SQLBits) derived from finding that Alastair had drawn the SQLBits logo over a month before I did mine. Feeling a little dejected I needed a new project. Simon set us a challenge of improving on it. One thing led to another and both Alastair and I started drawing portraits of Simon, and here we are.

So, lets start with the picture. I chose Simon’s twitter profile pic.

The first step is to convert the bitmap image into a vector image. A bitmap image is a collection of dots whereas a vector image is a collection of lines, better suited to drawing in SQL Server. There’s a great website that takes care of this for you, vectormagic.com and you get two free conversions to try it out. Upload the image, and then download the converted file in EPS format. EPS is ideal for our purposes as it’s a simple text file containing one instruction per line. We can then convert each line into a SQL spatial line.

I found the easiest way of converting the EPS lines into SQL spatial queries is with an Excel spreadsheet (download here). Paste the full contents of the EPS file into column A of the first sheet, the expressions in columns B to N strip out the coordinates and build them into SQL “geometry::STGeomFromText” commands.

You may notice that we’re converting the ‘curveto’ commands in the EPS file into ‘linestring’ commands in SQL. This does result in an approximation of the curves, but this is barely noticeable in drawings like this. Worth noting that SQL Denali is planned to support curves, so we will be able to make it more accurate.

Filter column O to remove the blanks (using the filter in cell O1), then copy the whole column to the clipboard. Then paste it into SQL Server Management Studio.

Then we have to do a little tweaking to make it work.

First we need to define a table variable at the very top of the query window. The first few lines of the query should look like this.

DECLARE @drawing TABLE (geom geometry);

INSERT INTO @drawing (geom) VALUES

(geometry::STGeomFromText(‘LINESTRING (0.00 960.00, 640.00 960.00)’,0))
,(geometry::STGeomFromText(‘LINESTRING (640.00 960.00, 640.00 800.00)’,0))

We can only insert 1000 rows at a time into a table variable, so we need to break up the insert statement up. Every 900 rows or so add a new INSERT statement such as

,(geometry::STGeomFromText(‘LINESTRING (199.39 416.98, 199.07 422.49)’,0))
,(geometry::STGeomFromText(‘LINESTRING (252.58 421.98, 252.87 424.99)’,0))

INSERT INTO @drawing (geom) VALUES

(geometry::STGeomFromText(‘LINESTRING (256.22 430.24, 257.03 426.03)’,0))
,(geometry::STGeomFromText(‘LINESTRING (256.19 417.52, 258.07 416.85)’,0))

Then at the end, we just need to select all rows from the table variable

SELECT * FROM @drawing

If all goes well, you should end up with something like this


You can download the finished SQL script here

SQL Server art at work!

SQLBits 8 in Photos

Well what a few days. Everyone I spoke to agreed that SQLBits 8 has been the best SQLBits conference yet. The organisers did a fantastic job, the whole event seemed to run incredibly smoothly and was very well received. I’m sure there’s a certain element of the swan effect, with frantic paddling behind the scenes, but they pulled it off very well indeed. A massive thanks for all of your effort.

For me, this was the first SQLBits conference that I’ve presented at, and I have to say it made it even better for me. I’ve got so much out of SQLBits over the years it’s really nice to be able to give something back and contribute to the event. I thoroughly enjoyed presenting my sessions, and will definitely get a session or two lined up for the next conference.

I took a few photos over the 3 days, and thought I’d share them to give you a feel of the event if you couldn’t make it.

The venue: Brighton

And we couldn’t have asked for better weather. Just fabulous.

The hotel, The Brighton Grand.

The general consensus seems to have been that the best SQLBits conferences so far have been Celtic Manor and Brighton. To me, this is because both were held in hotels, which brings the social and work elements of the conference together.

The first day: SQLBits Insight

This is the first time the ‘Insight’ day has been held, and is aimed at more of a non-technical audience; managers, decision makers, architects etc.

I thought it was a great success, and was far better attended than I expected, with every seat full and standing room only. Great to see.

The panel discussion.

Some really interesting topics covered, with a very experienced panel. Mark Souza (SQL Server General Manager) and Guy Lucchi (CTO for the NHS project at CSC) were especially interesting. If anyone tries telling you that SQL Server doesn’t scale, tell them to get in touch with Guy – the sheer scale of the NHS project defies belief, and is all SQL Server…

The Woz!

One of the highlights for me was Steve Wozniak, co-founder of Apple. He gave a fascinating talk about his history with electronics, and how it has led him to be involved with the Fusion IO team. If you’ve not heard of Fusion IO, go and look them up, they have some awesome kit for insanely fast storage systems. Very impressive.

Not just fast storage space…

Love this photo, tweeted by Aaron Bertrand. Is this the world’s most expensive iPhone charger?!

The Games.

On Thursday night, the Fusion IO guys put on the best SQLBits party yet. Not only can they give you really impressive IO stats, they can also give you a really impressive hangover… One of the highlights was the bucking bronco.

This is Nigel from Barclays Loans getting thrown off with an element of style! He won an X-Box for his troubles though!

Update!
Nigel has just been in touch to clarify a point… He did of course give his XBox to Kath (from innocent drinks) straight away as an act of gallantry. As he wonderfully put it: “All I got out of it is nasty bruising, a faulty thumb, and a good story”

And who says chivalry is dead?! Most admirable Nigel, you put the rest of us to shame as there’s no way I’m giving up my iPod!
 

Woz signed iPod!

I couldn’t resist having a go myself, and won an iPod signed by Steve Wozniak! Result.

The mandatory late night greasy spoon. Well, you need something to soak up the free bar!

One of my favourite parts of the conference is the people. It’s great to meet up with old colleagues, friends and clients, as well as putting faces to the twitterati and of course meeting some great new people.
The SQLCAT team.

Great to have Thomas Kejser and co from the SQL Customer Advisory Team. These guys know their stuff!

I have to confess, I did manage to sneak off for a few minutes and enjoy an ice cream on the beach… It would have been rude not to, I’m sure you’ll agree…
Friday night fun and games began with a game of giant Connect 4. I’m proud to say I managed a 4 game unbeaten run before having to quit on a high and head off to the speakers and sponsors meal.
The speakers and sponsors meal on Friday night
My conference room!

I gave two talks on the Saturday, one ‘lightning’ talk and one full hour session, both in this room. The lightning talks were a great success, lots of speakers both experienced and new, presenting a brief 5-15 minute session on something interesting. I was pleasantly surprised by the great turnout to these sessions, I hope they do more at the next SQLBits.

It’s not all about learning SQL stuff…

Jamie Thompson did a great job of recording some great impromptu videos which he’s uploaded to YouTube. One of his more amusing and random videos is of Allan Mitchell explaining the finer details of the espresso macchiato.

He also recorded a couple of me summarising my two sessions: here and here.

In summary, great venue, great conference, great sessions, great people.

See you all at SQLBits 9!

 

Drawing a logo or diagram using SQL spatial data

I have to admit that I’m really excited about presenting a session at SQLBits 8 in Brighton next week. I’ve been an avid supporter of SQLBits since the first conference that I attended (SQLBits 2), and am thoroughly looking forward to finally getting a chance to be a part of the event and presenting my own session. If you’re going, I hope to see you there!

 

My session is about using SSRS, SQL spatial data and DMVs to visualise SSAS OLAP cube structures and generate real-time automated cube documentation (blog post here if you want to know more…).

 

This shows an unusual use for spatial data, drawing diagrams instead of the usual demonstrations which are pretty much always displaying sales by region on a map etc. Whilst writing my demos, it got me thinking – why not use spatial data to draw even more complex pictures, diagrams or logos…

 

So, I set to work trying to write a query to draw the SQLBits logo…

 

The first step is to define the coordinates of the image edges. For this I found a great website designed to help you create HTML image maps (www.image-maps.com). You can upload the image then just click on every corner. It turns this into an HTML image map, which without too much work can be converted into a SQL spatial query.

 

I’ve made it so that each object (or letter) is one query so all 8 queries (s, q, l, b, i, t, s, and the database in the middle) are union’d together to create the entire image.

 

Simple letters (s, l, t & s) are a single polygon, so we can use

SELECT geometry::STPolyFromText(‘POLYGON ((x y, x y, x y))’,0) AS Drawing

Where each xy pairing is a point around the image, and the last point must be the same as the first, to create a closed polygon.

 

Complex letters such as q however need a multi polygon. These allow us to create one polygon for the outline, and then another to remove the hole in the middle. i.e.

SELECT geometry::STMPolyFromText(‘MULTIPOLYGON (((x y, x y, x y)),((x y, x y, x y)))’,0) AS Drawing

With each coordinate group following the same rules as above.

 

We end up with this

SELECT geometry::STPolyFromText(‘POLYGON ((104 -222, 173 -222, 174 -174, 171 -160, 163 -147, 150 -137, 136 -128, 123 -123, 110 -117, 82 -116, 61 -122, 41 -134, 17 -150, 6 -173, 1 -194, 0 -232, 9 -259, 21 -276, 32 -289, 52 -302, 69 -312, 88 -320, 105 -335, 110 -375, 102 -390, 84 -395, 75 -385, 76 -330, 5 -333, 7 -390, 11 -411, 25 -428, 42 -442, 67 -451, 105 -453, 126 -446, 144 -439, 162 -424, 173 -404, 180 -382, 182 -337, 178 -311, 167 -296, 153 -279, 138 -268, 89 -234, 75 -222, 71 -208, 73 -188, 88 -178, 100 -190, 105 -220, 104 -222))’,0) AS Drawing

UNION ALL

SELECT geometry::STMPolyFromText(‘MULTIPOLYGON (((324 -127, 404 -127, 405 -488, 322 -490, 322 -421, 311 -432, 291 -446, 277 -452, 259 -453, 248 -446, 239 -440, 228 -429, 221 -419, 215 -402, 215 -386, 213 -188, 216 -174, 219 -159, 226 -148, 235 -140, 245 -132, 261 -127, 278 -127, 294 -134, 306 -143, 322 -158, 324 -127)),((296 -191, 300 -186, 308 -182, 319 -188, 324 -196, 322 -384, 317 -391, 311 -395, 305 -395, 300 -395, 293 -388, 296 -191)))’,0) AS Drawing

UNION ALL

SELECT geometry::STPolyFromText(‘POLYGON ((447 -62, 532 -65, 532 -450, 447 -450, 447 -62))’,0) AS Drawing

UNION ALL

SELECT geometry::STMPolyFromText(‘MULTIPOLYGON (((991 -170, 1053 -146, 1055 -209, 1065 -201, 1072 -190, 1089 -183, 1108 -181, 1122 -191, 1134 -199, 1139 -217, 1140 -386, 1133 -399, 1129 -408, 1116 -418, 1104 -422, 1090 -419, 1078 -413, 1073 -405, 1066 -397, 1055 -386, 1054 -405, 991 -381, 991 -170)),((1053 -233, 1057 -226, 1067 -224, 1078 -235, 1078 -366, 1074 -373, 1063 -375, 1054 -367, 1053 -233)))’,0) AS Drawing

UNION ALL

SELECT geometry::STMPolyFromText(‘MULTIPOLYGON (((1159 -199, 1226 -198, 1227 -431, 1160 -428, 1159 -199)),((1161 -121, 1227 -111, 1228 -162, 1162 -169, 1161 -121)))’,0) AS Drawing

UNION ALL

SELECT geometry::STPolyFromText(‘POLYGON ((1260 -132, 1322 -133, 1324 -183, 1348 -184, 1350 -227, 1323 -227, 1323 -378, 1354 -377, 1354 -421, 1297 -433, 1283 -432, 1274 -426, 1267 -420, 1260 -407, 1261 -224, 1243 -225, 1241 -179, 1260 -181, 1260 -132))’,0) AS Drawing

UNION ALL

SELECT geometry::STPolyFromText(‘POLYGON ((1445 -259, 1447 -233, 1445 -228, 1438 -224, 1427 -225, 1424 -236, 1426 -252, 1435 -266, 1451 -275, 1465 -286, 1479 -294, 1491 -307, 1499 -319, 1498 -341, 1493 -354, 1485 -369, 1476 -382, 1459 -393, 1440 -401, 1421 -404, 1404 -404, 1393 -398, 1379 -386, 1376 -370, 1373 -364, 1373 -334, 1423 -330, 1424 -359, 1432 -366, 1440 -364, 1448 -358, 1449 -340, 1447 -328, 1440 -319, 1426 -314, 1416 -307, 1406 -300, 1393 -294, 1385 -283, 1379 -270, 1376 -258, 1371 -245, 1371 -232, 1375 -219, 1382 -204, 1390 -189, 1405 -182, 1428 -182, 1442 -192, 1458 -201, 1473 -214, 1489 -231, 1494 -260, 1445 -259))’,0) AS Drawing

UNION ALL

SELECT geometry::STMPolyFromText(‘MULTIPOLYGON (((579 -40, 589 -29, 602 -22, 621 -15, 639 -13, 656 -9, 676 -7, 698 -4, 722 -2, 749 -1, 853 -0, 886 -4, 915 -7, 937 -12, 967 -16, 984 -25, 1000 -32, 1006 -59, 999 -61, 986 -65, 976 -75, 970 -88, 968 -102, 971 -121, 956 -127, 945 -135, 931 -149, 921 -166, 921 -183, 928 -199, 939 -209, 945 -216, 937 -224, 927 -234, 918 -246, 915 -260, 915 -278, 923 -293, 928 -308, 944 -317, 936 -328, 927 -341, 924 -354, 923 -374, 933 -389, 943 -400, 957 -404, 968 -407, 967 -420, 967 -437, 976 -449, 988 -459, 1008 -467, 1000 -476, 991 -483, 971 -492, 957 -494, 943 -500, 926 -503, 906 -507, 888 -507, 709 -508, 692 -506, 674 -505, 656 -501, 642 -498, 624 -496, 606 -491, 591 -485, 577 -473, 579 -40)), ‘

+ ‘((579 -136, 591 -144, 606 -150, 623 -154, 641 -159, 664 -163, 684 -165, 702 -169, 732 -170, 758 -171, 845 -173, 873 -170, 925 -162, 922 -172, 901 -177, 862 -183, 818 -186, 759 -185, 714 -183, 681 -182, 647 -174, 613 -168, 588 -161, 580 -151, 579 -136)),’

+ ‘((578 -246, 593 -257, 613 -265, 636 -271, 664 -276, 694 -277, 724 -281, 789 -283, 833 -283, 873 -281, 916 -273, 919 -285, 884 -293, 840 -295, 809 -299, 768 -299, 731 -298, 703 -295, 672 -293, 647 -289, 624 -281, 605 -276, 593 -271, 580 -262, 579 -262, 578 -246)),’

+ ‘((578 -360, 593 -369, 615 -377, 635 -382, 664 -388, 689 -390, 716 -394, 751 -395, 857 -394, 881 -391, 905 -389, 932 -383, 939 -392, 917 -399, 880 -405, 839 -409, 786 -411, 739 -411, 701 -409, 667 -405, 635 -399, 611 -392, 591 -383, 580 -377, 578 -360)))’,0) AS Drawing

 

Which, when we run it in SQL 2008 Management Studio, returns the results as

When you run a query which includes a spatial data type as a column, SSMS gives us a new tab, ‘Spatial results’. Clicking on this gives us a visual representation of the spatial results.

 

Note that I’ve had some trouble viewing multi polygons in SQL 2008 Management Studio, and can only get them to work in R2. Basic polygons seem to be fine in SQL 2008 though.

 

We can put this directly in a SQL Server Reporting Services map component (set to planar coordinates) and see it in a report.

 

Frog-Blog out

 

Power BI Sentinel
The Frog Blog

Team Purple Frog specialise in designing and implementing Microsoft Data Analytics solutions, including Data Warehouses, Cubes, SQL Server, SSIS, ADF, SSAS, Power BI, MDX, DAX, Machine Learning and more.

This is a collection of thoughts, ramblings and ideas that we think would be useful to share.

Authors:

Alex Whittles
(MVP)
Reiss McSporran
Jeet Kainth
Jon Fletcher
Nick Edwards

Data Platform MVP

Power BI Sentinel
Frog Blog Out
twitter
rssicon