Business Intelligence

Well after 3.5 years, I’ve finally completed my MSc Business Intelligence – hoorah! And to reward the time, effort and increased grey hair, they saw fit to give me a merit as well.

During the last year I’ve been writing a thesis investigating the performance characteristics of loading data into data warehouse dimensions. Specifically loading Type 2 SCDs using SSIS.

For those who have followed the previous posts and my conference talks on using T-SQL Merge for the purpose, you won’t be surprised at the direction of the dissertation, but it provides a useful performance comparison between T-SQL Merge, SSIS Merge Join, SSIS Lookup and the SSIS SCD Wizard.

I won’t go into the full details here of the project or results, but will show a couple of the summary charts which are of most interest. You can download the full project here:

The charts below shows the duration taken for the Lookup, Merge and Merge-Join methods (SCD Wizard excluded for obvious reasons!).
The top chart shows the performance on a Raid 10 array of traditional hard disks.
The second chart shows the same tests run on a Fusion IO NAND flash card.

The charts clearly show that the Lookup method is the least favoured. Of the other two, Merge is [just] preferred when using solid state, although statistically they are equivalent. On HDDs, Merge and Merge-Join are equivalent until you’re loading 2-3m rows per batch, at which point Merge-Join becomes the preferred option.

Full test results and analysis in the PDF download above.

My previous few posts show how using a T-SQL approach like Merge can provide huge development benefits by automating the code. This research now shows that unless you’re loading very large data volumes the performance is equivalent to more traditional approaches.

Hope this is of use. If you want to know a bit more without reading the full 99 pages & 23k words (who could blame you?!), then my SQLBits talk video is now on-line here. This talk is slightly out of date as it was presented before I’d finished the research and analysis, but it’s largely accurate. I presented a more up to date version on a webinar for the PASS Virtual BI chapter. The recording isn’t currently available [When this post was written] but should be up soon. Keep checking on the BI PASS Chapter website.

Frog-Blog-Out

© Alex Whittles, Purple Frog Systems Ltd

SQLBits X Video Now available

The video of my talk at SQLBits X is now available on the SQLBits website here. The talk was focused on presenting the results of my MSc Business Intelligence dissertation, comparing the performance of different methods of using SSIS to load data warehouse dimensions, specifically type 2 SCDs.

The talk also covers a comparison of the performance between traditional hard disks and solid state storage systems such as Fusion IO.

I then present a method of using the T-SQL Merge statement to automate a significant part of the ETL process.

You can find the code behind the demos on various recent Frog-Blog posts, and there is more to come, so look back soon!

 

PASS BI Virtual Chapter Talk

For those that would rather hear the talk presented live, or want to ask any questions, please join me at a repeat of this talk over Live Meeting for the PASS BI Virtual Chapter on Thursday 28th June, at 5pm UK time, 12pm EDT (US). You can find the details on the PASS BI chapter website here

© Alex Whittles, Purple Frog Systems Ltd

This is the 3rd post in the Frog-Blog series on the awesomeness of T-SQL Merge.

In this post we’ll be looking at how we can automate the creation of the merge statement to reduce development time and improve reliability and flexibility of the ETL process. I discussed this in the 2nd half of a talk I gave at the UK technical launch of SQL Server 2012 at SQLBits X. Thank you to the great audience who came to that talk, this post is for your benefit and is a result of the feedback and requests from you guys.

Why automate merge?

As we saw in the previous post, merge is an incredibly powerful tool when loading data into data warehouse dimensions (specifically SCDs – slowly changing dimensions). The whole process can be wrapped up into a very neat stored proc which can save a considerable mount of time writing the equivalent functionality in SSIS. In the next installment of this series I’ll be discussing the performance of it compared to other methods of loading SCDs in SSIS (take a look at the SQLBits talk video [when it’s released] for a preview!). Suffice to say for now that in my [pretty comprehensive] tests it’s one of the fastest methods of loading SCDs.

If you missed the talk, you can download the slide deck here whilst you’re waiting for the video.

The problem that stops a lot of people using merge is the perceived complexity of the statement. It can be very easy to get things wrong, with pretty bad consequences on your dimension data.

The easiest way to avoid this complexity and simplify the process is to not write merge statements, but let an automated procedure to it for you – Simples!.

The other huge benefit is that, as we’ll see during this post, you can base the automation procedure on metadata, meaning that you can change the SCD functionality of your dimensions just by changing metadata, and not rewriting your code.

Note that in this post we’ll just be looking at Type 0 and 1 SCDs, not 2, 3 or 6. This is to keep things simple. Once you’ve mastered type 0 and 1, it’s a logical next step to expand things to deal with type 2s.

OK, so how do we do this?

First of all we need to set up two tables to use. Let’s create a simple Customer dimension. Alongside this we also need a staging table. I’m a big fan of using schemas to differentiate tables, so we’ll create dim.Customer and etl.Customer as our two tables.

CREATE SCHEMA [dim] AUTHORIZATION [dbo]
GO
CREATE SCHEMA [etl] AUTHORIZATION [dbo]
GO

CREATE TABLE [dim].[Customer](
    [CustomerKey]   [int] IDENTITY(1,1) NOT NULL,
    [Email]         [varchar](255)      NOT NULL,
    [FirstName]     [varchar](50)       NOT NULL,
    [LastName]      [varchar](50)       NOT NULL,
    [DoB]           [date]              NOT NULL,
    [Sex]           [char](1)           NOT NULL,
    [MaritalStatus] [varchar](10)       NOT NULL,
    [FirstCreated]  [date]              NOT NULL,
    [IsRowCurrent]  [bit]               NOT NULL,
    [ValidFrom]     [datetime]          NOT NULL,
    [ValidTo]       [datetime]          NOT NULL,
    [LastUpdated]   [datetime]          NOT NULL
 CONSTRAINT [PK_DimCustomer] PRIMARY KEY CLUSTERED 
(
	[CustomerKey] ASC
))
GO

CREATE TABLE [etl].[Customer](
    [Email]         [varchar](255)  NOT NULL,
    [FirstName]     [varchar](50)   NOT NULL,
    [LastName]      [varchar](50)   NOT NULL,
    [DoB]           [date]          NOT NULL,
    [Sex]           [char](1)       NOT NULL,
    [MaritalStatus] [varchar](10)   NOT NULL,
    [FirstCreated]  [date]          NOT NULL
)

So the dim table contains our primary surrogate key, business key (email address in this case), customer details and a series of audit fields (IsRowCurrent, ValidFrom, etc.). The etl staging table only contains the business key and customer details.

We then need to store the details of each field. i.e. how should each field be interpreted – is it a primary key, business, key, type 0 or 1, or an audit field. We need this so that we can put the correct fields into the correct place in the merge statement. You could create a table to store this information, however I prefer to use the extended properties of the fields.

EXEC sys.sp_addextendedproperty @level2name=N'CustomerKey',  @value=N'PK' ,    
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'Email',        @value=N'BK' ,    
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'FirstName',    @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'LastName',     @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'DoB',          @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'Sex',          @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'MaritalStatus',@value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'FirstCreated', @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'ValidFrom',    @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'ValidTo',      @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'IsRowCurrent', @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'LastUpdated',  @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'

Or you can obviously just enter the extended property manually using SSMS

The SSIS package should output all customer records into the etl table, with no regard for whether they are new customers, old customers, changed or not. The merge statement will take care of that.

Automating Merge

The first stage is to examine the structure of merge.

   MERGE   [DIMENSION TABLE]  as Target
   USING   [STAGING TABLE]    as Source
      ON   [LIST OF BUSINESS KEY FIELDS]
   WHEN MATCHED AND
         Target.[LIST OF TYPE 1 FIELDS] <> Source.[LIST OF TYPE 1 FIELDS]
      THEN UPDATE SET
         [LIST OF TYPE 1 FIELDS] = Source.[LIST OF TYPE 1 FIELDS]
   WHEN NOT MATCHED THEN INSERT
         [LIST OF ALL FIELDS]
      VALUES
         Source.[LIST OF ALL FIELDS]

The text in black is the skeleton of the statement, with the text in red being the details specific to the dimension. It’s these red items which we need to retrieve from the metadata of the dimension in order to create the full merge statement.

We can retrieve the extended properties using the sys.extended_properties DMV. This allows us to pull out a list of all fields which have a specific extended property set, e.g. all PK fields, all BK fields, all type 2 fields etc. etc. If we then put a few of these queries into cursors, we can loop through them and build up a dynamic SQL query. Yes I know, dynamic SQL should be avoided and is evil etc., however… this use is an exception and does truly make the World a better place.

I’m not going to explain the resulting proc in minute detail, so instead please just download it here and work through it yourself. I will however explain a couple of items which are pretty important:

It’s important to keep the naming convention of your dimensions consistent. This doesn’t mean that every dimension must be identical, some may need inferred member support, some may need type 2 tracking fields (e.g. IsRowCurrent) and some may not; the critical thing is that all of your fields, if they do exist, should be named consistently. The automation proc can then look for specific field names and include them in the merge statement if necessary.

There is a parameter in the proc called @Execute. This offers the possibility of either executing the resulting merge statement directly, or just printing out the statement. If you only want to use this to automate the development process then this allows you to do just that, you can then just copy and paste the resulting statement into SSIS or into a stored proc.

Result

The automated generation of T-SQL merge statement to handle type 0 & 1 SCDs!
Hopefully you can see how you can expand this to also cope with Type 2 SCDs, following the structure in my earlier posts.

Download the SQL scripts here
 
Frog-Blog Out

© Alex Whittles, Purple Frog Systems Ltd

In my last blog post I showed the basic concepts of using the T-SQL Merge statement, available in SQL Server 2008 onwards.

In this post we’ll take it a step further and show how we can use it for loading data warehouse dimensions, and managing the SCD (slowly changing dimension) process. Before we start, let’s have a quick catch up on what an SCD is…

What is a Slowly Changing Dimension (SCD)?

If you want a full explanation of slowly changing dimensions then you’ve come to the wrong place, I’m assuming a moderate level of experience of SCDs here, check out Wikipedia for some background, but in short, they manage the tracking of attribute history in dimensional data warehouses.

Most data warehouses contain type 0, 1 and 2 SCDs, so we’ll cope with those for now.

  • Type 0 – Ignore updates
  • Type 1 – Only keep latest version
  • Type 2 – Track history by creating a new row

Type 2 is commonly stored in a fashion similar to this.

Both records show the same customer but in this case Jane got married and changed her name & title. We terminate the old record by setting IsRowCurrent=0 and create a new record with the new details. Each row also contains ValidFrom and ValidTo dates which allow us to identify the correct record for a particular point in time.

That’s enough of that, let’s get on with doing this using Merge

Using Merge to load SCD

The first stage is to save the output rows from the ETL process to a staging table. We can then use Merge to process these into the live dimension.

We saw in the previous post how to either insert or update a record depending on whether it already exists. We can start with this and enhance as we go. First lets figure out what logic we want to perform

  • If the record doesn’t exist, create it
  • If the record does exist
    • Type 0 fields – ignore
    • Type 1 fields – update fields
    • Type 2 fields – terminate existing record, insert a new record
  • If the record exists in the dimension, but not in the updated source file – terminate record

The last option is rarely used in my experience, as it only works when you perform a full load of the dimension every time. It’s more common to process an incremental load, but I’ve included it here for completeness.

The main difference here, over a basic upsert, is the handling of type 2s; we need to perform two separate operations on the dimension for every incoming record. Firstly we terminate the existing row then we have to insert a new row.

The T-SQL Merge statement can only update a single row per incoming row, but there’s a trick that we can take advantage of by making use of the OUTPUT clause. Merge can output the results of what it has done, which in turn can be consumed by a separate INSERT statement.

We’ll therefore use the MERGE statement to update the existing record, terminating it, and then pass the relevant source rows out to the INSERT statement to create the new row.

Let’s look at an example. Download the code here which will create the necessary tables and data to work on.

Main Merge Statement

We’ll start with a statement very similar to the previous post, with only a couple of minor amendments:

  • We include IsRowCurrent into the joining clause. We only ever want to update the current records, not the history.
  • DoB is removed from the WHEN MATCHED clause. We’re going to treat DoB as a type 1 change, if it’s updated then we assume it’s a correction rather than a new date of birth which should be tracked. We’ll deal with this Type 1 later on
  • The UPDATE statement in the WHEN MATCHED clause doesn’t change the fields, only terminates the row by setting the IsRowCurrent and ValidTo fields (as well as LastUpdated)
 MERGE Customer        AS [Target]
 USING StagingCustomer AS [Source]
    ON Target.Email        = Source.Email
   AND Target.IsRowCurrent = 1
 WHEN MATCHED AND
     (
          Target.FirstName <> Source.FirstName
       OR Target.LastName  <> Source.LastName
       OR Target.Title     <> Source.Title
     )
     THEN UPDATE SET
        IsRowCurrent     = 0
       ,LastUpdated      = GETDATE()
       ,ValidTo          = GETDATE()
 WHEN NOT MATCHED BY TARGET
     THEN INSERT (
         FirstName
        ,LastName
        ,Title
        ,DoB
        ,Email
        ,LastUpdated
        ,IsRowCurrent
        ,ValidFrom
        ,ValidTo
       ) VALUES (
         Source.FirstName
        ,Source.LastName
        ,Source.Title
        ,Source.DoB
        ,Source.Email
        ,GETDATE()      --LastUpdated
        ,1              --IsRowCurrent
        ,GETDATE()      --ValidFrom
        ,'9999-12-31'   --ValidTo
       )
 WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1
     THEN UPDATE SET
         IsRowCurrent = 0
        ,LastUpdated  = GETDATE()
        ,ValidTo      = GETDATE()

The ‘When Matched’ section includes extra clauses which define which fields should be treated as Type 2.

The ‘When Not Matched By Target’ section deals with inserting the new records which didn’t previously exist.

The ‘When Not Matched By Source’ section deals with terminating records which are no longer received from the source. Usually this section can be deleted, especially if the data is received incrementally.

*** UPDATE *** Thank you to Sergey (in the comments below) for pointing out an error in this code. I’ve now corrected the ‘WHEN NOT MATCHED BY SOURCE’ line to include ‘AND Target.IsRowCurrent=1′. If this is omitted then all historic (IsRowCurrent=0) records are always updated with today’s date. We only want to terminate current records, not already terminated records.

We then add an OUTPUT clause to the end of the statement

 OUTPUT $action AS Action
       ,Source.*

The OUTPUT clause tells MERGE to generate an output dataset. This can consist of any of the Source table’s fields or the Target table’s fields. We can also specify $Action as an extra field which will identify, for each row, whether

it was dealt with via an INSERT, UPDATE or DELETE. For this purpose we only care about the UPDATES, so we’ll use this to filter the records later on. We also only need the Source data, not the Target, so we’ll return Source.*

We wrap this up within an INSERT statement which will insert the new record for the changed dimension member.

INSERT INTO Customer
   ( FirstName
    ,LastName
    ,Title
    ,DoB
    ,Email
    ,LastUpdated
    ,IsRowCurrent
    ,ValidFrom
    ,ValidTo
   )
SELECT
     FirstName
    ,LastName
    ,Title
    ,DoB
    ,Email
    ,GETDATE()    --LastUpdated
    ,1            --IsRowCurrent
    ,GETDATE()    --ValidFrom
    ,'9999-12-31' --ValidTo
FROM (
  MERGE Customer        AS [Target]
  USING StagingCustomer AS [Source]
     ON Target.Email        = Source.Email
    AND Target.IsRowCurrent = 1
  WHEN MATCHED AND
      (
           Target.FirstName <> Source.FirstName
        OR Target.LastName  <> Source.LastName
        OR Target.Title     <> Source.Title
      )
      THEN UPDATE SET
         IsRowCurrent     = 0
        ,LastUpdated      = GETDATE()
        ,ValidTo          = GETDATE()
  WHEN NOT MATCHED BY TARGET
      THEN INSERT (
          FirstName
         ,LastName
         ,Title
         ,DoB
         ,Email
         ,LastUpdated
         ,IsRowCurrent
         ,ValidFrom
         ,ValidTo
        ) VALUES (
          Source.FirstName
         ,Source.LastName
         ,Source.Title
         ,Source.DoB
         ,Source.Email
         ,GETDATE()      --LastUpdated
         ,1              --IsRowCurrent
         ,GETDATE()      --ValidFrom
         ,'9999-12-31'   --ValidTo
        )
  WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1
      THEN UPDATE SET
          IsRowCurrent = 0
         ,LastUpdated  = GETDATE()
         ,ValidTo      = GETDATE()
  OUTPUT $action AS Action
        ,[Source].*
) AS MergeOutput
WHERE MergeOutput.Action = 'UPDATE'
  AND Email IS NOT NULL
;

Note that the output clause is restricted so we only return the ‘UPDATE’ rows. As we’re using the email field as the business key, we should also ensure that we only insert records which have a valid email address.

So Type 2 changes have now been dealt with, by terminating the old version of the record and inserting the new version. Type 0 fields are just left out of the entire process, so are taken care of by just ignoring them. Therefore the only thing left is to manage the Type 1 fields.

We have two options here;

  • Update all historical records to the new value
  • Update only the current record to the new value

These are obviously only valid when there is a mix of type 1 and 2 attributes. If we’re just looking at Type 1 then there will be no historical records. In a true Type 1 scenario the first option is correct. All history (of Type 1 fields) is lost. The second option can be a valid option when it would be beneficial to keep a limited history of Type 1 fields.

This would mean that historical records created by Type 2 changes also keep a record of the Type 1 attribute values that were valid at the time the record was terminated. It doesn’t keep a full history of Type 1 attributes but sometimes this can be useful.

  UPDATE C
      SET DoB         = SC.DoB
         ,LastUpdated = GETDATE()
  FROM Customer C
     INNER JOIN StagingCustomer SC
          ON C.Email        =  SC.Email
       --AND C.IsRowCurrent =  1      --Optional
         AND C.DoB          <> SC.DoB

This block of code updates the Type 1 attributes (in this case, DoB). The line 7 (the IsRowCurrent) check is optional depending on whether you only want to update current or all records.

So in one SQL statement we’ve managed the entire load process of all Type 2 SCDs, and with one more we’ve also managed all Type 1 fields.

I’ve been performing a large number of performance tests on loading Type 2s using various methods (another blog post to follow, as well as a talk that I’ll be presenting at SQL Bits X), and the performance of this method is very fast. In fact there’s very little difference in performance between using this method and using the SSIS Merge Join component.

This is now my preferred approach to loading Type 2 SCDs, slightly faster methods may be available, but as we’ll see in later blog posts, this is such a quick method to implement, as well as being incredibly flexible as it can be controlled entirely from metadata.

Long live the Merge statement!

Frog-Blog Out

© Alex Whittles, Purple Frog Systems Ltd

PowerPivot Vs QlikView 101 – SQLBits Video

The video is now available from my PowerPivot and QlikView talk at SQLBits 9 in Liverpool in September 2011. You can download or watch the video here.


In this 1 hour session I create interactive dashboards from scratch in both PowerPivot and QlikView, showing how to set up the data model, overcome common pitfalls and build the dashboards. I create almost identical dashboards in both systems and highllight the pros and cons of each system.


PowerPivot QlikView


You can find the scripts and code to go with this video in this blog post


SQLBits 10 is coming!!

Today the SQLBits organisers have announced that SQLBits 10 will be held in London between 29th – 31st March 2012 in the Novotel London West. It’s going to be even bigger and better, so keep an eye on the website and get your tickets early.

© Alex Whittles, Purple Frog Systems Ltd

Thank you to the SQLBits committee, speakers, helpers and attendees, for making SQLBits 9 one of the best yet. What a great example SQLBits is of the power of the SQL Server community coming together to help each other, and enjoy a few beers in the process!

QlikView PowerPivotI was delighted to have my “PowerPivot & QlikView 101″ session chosen by the committee, which I presented on Saturday afternoon. Thank you to all those who attended, great to see such a good number there. If you haven’t yet submitted your feedback for my session, and the event as a whole, please do take the time to do it. It helps everyone improve the quality of the event for next time.

For those who atttended my session, here are the slides, with a summary of the results and code etc.

Slides: Download the PowerPivot & QlikView 101 slide deck here

QlikView Script: Download the QlikView Script here

PowerPivot Workboook: Download the PowerPivot workbook here

Session Video: The session recording isn’t yet available, I’ll post a link here when it’s ready.

QlikView Expressions

Sales Amount Visual Cue for the green shading
   Upper: (SUM(TOTAL SalesAmount) / COUNT(TOTAL SalesTerritoryRegion))*1.2
   Lower: (SUM(TOTAL SalesAmount) / COUNT(TOTAL SalesTerritoryRegion))*0.8

Sales Amount blue bar Guage maximum
   max(ALL Aggr(SUM(SalesAmount),OrderMonth))

Sales Amount Year to Date
   rangesum(above(sum(SalesAmount),0,12))

Sales Amount Previous Year
   SUM({<OrderYear={‘$(=Max(OrderYear)-1)’}>} SalesAmount)

PowerPivot DAX Expressions

Sales Amount Year to Date
   =TOTALYTD(SUM(FactInternetSales[SalesAmount]), DimOrderDate[FullDateAlternateKey], ALL(DimOrderDate))

Sales Amount Previous Year
   =CALCULATE(SUM(FactInternetSales[SalesAmount]), SAMEPERIODLASTYEAR(DimOrderDate[FullDateAlternateKey]), ALL(DimOrderDate))

 

PowerPivot Results

PowerPivot Dashboard

 

QlikView Results

QlikView Dashboard

If you have any questions about any of this, please get in touch.

Frog-Blog-Out

© Alex Whittles, Purple Frog Systems Ltd

SQLBits 9

Submit a session for SQLBits

It only seems like yesterday since the SQLBits crew put on their awesome 8th conference in Brighton. The next conference has just been announced, SQLBits will move to Liverpool for SQLBits 9 – “Query Across The Mersey” (…Really?!) between September 29th and October 1st.

Registration’s not yet open, but you can find out more at www.SQLBits.com

I’ve already submitted two sessions to present, “PowerPivot & QlikView 101″ and “Advanced data modelling for data warehousing and cubes”. Make sure you vote for them when voting opens! You can read full summaries of all sessions submitted so far here.

SQL User Groups

We’ve also got a busy time coming up with the UK SQL Server user groups. I’ll be running the next Birmingham user group the week after SQLBits, on Monday 3rd October. The agenda and speakers have yet to be set, but it should be a goodun!

I’ll also be speaking at the Southampton user group in 2 weeks time, on Wednesday 6th July. I’ll be presenting my 3 part blog post on automating SSAS cube documentation. Matt Whitfield will also be presenting, so it should be a good night. Now stop reading this and go and register!!

© Alex Whittles, Purple Frog Systems Ltd

Automating OLAP cube documentation – SQLBits presentation

For anyone that missed my presentation at SQLBits 8 in April, the video is now available here.


In this 1 hour session I present a method of automating the creation of documentation for SSAS OLAP cubes by using DMVs (dynamic management views) and spatial data, querying the metadata of the cube in realtime.

The results include the BUS matrix, star schemas, attribute lists, hierarchies etc. and are all presented in SSRS.


The blog posts to go with this are here:


You can view the slide deck here

© Alex Whittles, Purple Frog Systems Ltd

I’m a happy chap. Why? Because I read a blog post yesterday by T.K. Anand (SSAS Principal Group Program Manager) about the vision and roadmap of Analysis Services.

There were slightly concerning questions last November (following the PASS conference) surounding the future of Analysis Services, or more specifically the UDM, the dimensional model that we all know and love in SSAS 2005 & 2008. The arrival of PowerPivot into the Microsoft BI arsenal has without a doubt moved the goalposts and added significant power, flexibility and usability to the BI stack. My concern, along with others (most notibly Chris Webb, who sparked somewhat of a stampede on the issue), was for the future of the UDM and the multitude of existing dimensional systems out in the field. Is the dimensional approach being phased out? Will it be supported in future editions? Will it be enhanced? Will the future BISM support the complexity and power we currently have with the UDM?

There’s no doubt that the overall approach to business intelligence is evolving. And this isn’t just in the cube space, it obviously has a direct effect on all other aspects of the BI strategy; the data warehouse, reporting layer, ETL etc.


From a BI consultant’s point of view, I don’t want to be recommending tools to my clients which have a restricted life span and don’t provide them a future proof upgrade path.

From a technology perspective, I’m a hardened supporter of the dimensional model. I recently designed a complex cube system for a banking client which had over 150 dimensions and facts, with thousands of lines of MDX to create a very sophisticated calculation framework for their liquidity modelling and loan profiling. I wouldn’t dream of doing that in a tabular approach like PowerPivot (in their current form).

From a personal point of view, where do I focus my attention in terms of training, research, blogging, user groups, conference sessions etc. etc.


I should point out that I’m very excited by, and fully committed to the tabular/PowerPivot route (along with VertiPaq, Crescent, DAX, etc.) for systems that it is suited to. In fact I’m using it right now to prototype a global BI solution for a very large client. There are however some solutions that do not fit well with the tabular approach and are best suited to a dimensional approach. I’m in favour of a hybrid framework which allows the right tool to be used for the right system. And it looks like that’s what we’re going to get.

The guys at Microsoft have now evolved and clarified the roadmap, and have confirmed that the BISM (business intelligence semantic model, i.e. The core of Analysis Services in SQL Server Denali) will contain two parallel approaches that can both be used for whichever situations they are best suited to. More importantly, they are both here to stay, will both be developed further, and there will be a cross-availability of functionality and tools between them.

Multi Dimensional Model

Essentially the same as the existing UDM, the multi-dimensional data model will support MDX and ROLAP/MOLAP data access. Existing OLAP cubes in SQL 2008 will easily upgrade to this.

Tabular Model

Think of this as hosted PowerPivot. A tabular approach with a column based data store, DAX as the expression language and either VertiPaq or Direct Query for data access.


The two will co-exist side by side within a singluar BISM, albeit initially with a degree of seperation. In the upcoming CTP2 release (July 2011?) there will not be any cross-availability of functionality, i.e. VertiPaq, Crescent and DAX will not be available to the dimensional model. However TK makes it clear that this is a short term restriction in the CTP, and that Microsoft are commited to getting this cross availability in place in the finished product.

If you’re involed in BI in any way, I really do encourage you to go and read TK’s post in detail. The Business Intelligence world is changing. I now have total confidence that it’s for the better.

Frog-blog-out

© Alex Whittles, Purple Frog Systems Ltd

One of the most useful aspects of a Business Intelligence system is the ability to add calculations to create new measures. This centralises the logic of the calculation into a single place, ensuring consistency and standardisation across the user base.

By way of example, a simple calculation for profit (Income – Expenditure) wouldn’t be provided by the source database and historically would be implemented in each and every report. In a data warehouse and/or cube we can create the calculation in a single place for everyone to use.

This post highlights some of methods of doing this, each with their respective pros and cons.

Calculated Members in SSAS Cube


SSAS provides a ‘Calculations’ tab in the cube designer which allows you to create new measures using MDX. You can use any combination of existing measures and dimension attributes, along with the plethora of MDX functions available to create highly complex calculations.
Pros:

  • Very complex calculations can be created using all available MDX functions
  • No changes are required to the structure of the data warehouse
  • Changes to the calculation will apply to every record, historic and new
  • The results are not stored in the warehouse or cube, so no extra space is required
  • New calculations can be added without having to deploy or reprocess the cube
  • Calculations can be scoped to any level of aggregation and granularity. Different calculations can even be used for different scopes
  • Calculations can easily combine measures from different measure groups

Cons:

  • The calculation will not make use of SSAS cube aggregations, reducing performance
  • SSAS drill through actions will not work
  • The calculation results are not available in the data warehouse, only the cube

SQL Calculations in the Data Source View


There’s a layer in-between the data warehouse and the cube called the data source view (DSV). This presents the relevant tables in the warehouse to the cube, and can be used to enhance the underlying data with calculations. This can either be the dsv layer within the cube project, or I prefer to create SQL Server views to encapsulate the logic.
Pros:

  • No changes are required to the table structure of the data warehouse
  • Calculations use SQL not MDX, reducing the complexity
  • Changes to the calculation will apply to every record, historic and new
  • The calculation will make full use of SSAS cube aggregations
  • SSAS drill through actions will work
  • The results are not stored in the warehouse, so the size of the database does not increase

Cons:

  • The cube must be redeployed and reprocessed before the new measure is available
  • The results of the calculation must be valid at the granularity of the fact table
  • The calculation results are not available in the data warehouse, only the cube

Calculate in the ETL process


Whilst bringing in data from the source data systems, it sometimes makes sense to perform calculations on the data at that point, and store the results in the warehouse.
Pros:

  • The results of the calculation will be available when querying the warehouse as well as the cube
  • In the ETL pipeline you can import other data sources (using lookups etc.) to utilise other data in the calculation
  • If the calculation uses time based data, or data valid at a specific time (i.e. share price) then by performing the calculation in the ETL, the correct time based data is used, without having to store the full history of the underlying source data
  • The calculation will make full use of SSAS cube aggregations
  • SSAS drill through actions will work

Cons:

  • You have to be able to alter the structure of the data warehouse, which isn’t always an option.
  • The results are stored in the warehouse, increasing the size of the database
  • The results of the calculation must be valid at the granularity of the fact table
  • If the calculation logic changes, all existing records must be updated


In Conclusion

If the calculation is valid for a single record, and it would be of benefit to have access to the results in the warehouse, then perform the calculation in the ETL pipeline and store teh results in the warehouse.

If the calculation is valid for a single record, and it would not be of benefit to have the results in the warehouse, then calculate it in the data source view.

If the calculation is too complex for SQL, requiring MDX functions, then create an MDX calculated measure in the cube.

© Alex Whittles, Purple Frog Systems Ltd
The Frog Blog

I'm Alex Whittles.

I specialise in designing and implementing SQL Server business intelligence solutions, and this is my blog! Just a collection of thoughts, techniques and ramblings on SQL Server, Cubes, Data Warehouses, MDX, DAX and whatever else comes to mind.

I'm Organising SQL Relay 2014

Submit a session for SQLBits

Frog Blog Out
twitter
rssicon