In my last blog post I showed the basic concepts of using the T-SQL Merge statement, available in SQL Server 2008 onwards.

In this post we’ll take it a step further and show how we can use it for loading data warehouse dimensions, and managing the SCD (slowly changing dimension) process. Before we start, let’s have a quick catch up on what an SCD is…

What is a Slowly Changing Dimension (SCD)?

If you want a full explanation of slowly changing dimensions then you’ve come to the wrong place, I’m assuming a moderate level of experience of SCDs here, check out Wikipedia for some background, but in short, they manage the tracking of attribute history in dimensional data warehouses.

Most data warehouses contain type 0, 1 and 2 SCDs, so we’ll cope with those for now.

  • Type 0 – Ignore updates
  • Type 1 – Only keep latest version
  • Type 2 – Track history by creating a new row

Type 2 is commonly stored in a fashion similar to this.

Both records show the same customer but in this case Jane got married and changed her name & title. We terminate the old record by setting IsRowCurrent=0 and create a new record with the new details. Each row also contains ValidFrom and ValidTo dates which allow us to identify the correct record for a particular point in time.

That’s enough of that, let’s get on with doing this using Merge

Using Merge to load SCD

The first stage is to save the output rows from the ETL process to a staging table. We can then use Merge to process these into the live dimension.

We saw in the previous post how to either insert or update a record depending on whether it already exists. We can start with this and enhance as we go. First lets figure out what logic we want to perform

  • If the record doesn’t exist, create it
  • If the record does exist
    • Type 0 fields – ignore
    • Type 1 fields – update fields
    • Type 2 fields – terminate existing record, insert a new record
  • If the record exists in the dimension, but not in the updated source file – terminate record

The last option is rarely used in my experience, as it only works when you perform a full load of the dimension every time. It’s more common to process an incremental load, but I’ve included it here for completeness.

The main difference here, over a basic upsert, is the handling of type 2s; we need to perform two separate operations on the dimension for every incoming record. Firstly we terminate the existing row then we have to insert a new row.

The T-SQL Merge statement can only update a single row per incoming row, but there’s a trick that we can take advantage of by making use of the OUTPUT clause. Merge can output the results of what it has done, which in turn can be consumed by a separate INSERT statement.

We’ll therefore use the MERGE statement to update the existing record, terminating it, and then pass the relevant source rows out to the INSERT statement to create the new row.

Let’s look at an example. Download the code here which will create the necessary tables and data to work on.

Main Merge Statement

We’ll start with a statement very similar to the previous post, with only a couple of minor amendments:

  • We include IsRowCurrent into the joining clause. We only ever want to update the current records, not the history.
  • DoB is removed from the WHEN MATCHED clause. We’re going to treat DoB as a type 1 change, if it’s updated then we assume it’s a correction rather than a new date of birth which should be tracked. We’ll deal with this Type 1 later on
  • The UPDATE statement in the WHEN MATCHED clause doesn’t change the fields, only terminates the row by setting the IsRowCurrent and ValidTo fields (as well as LastUpdated)
 MERGE Customer        AS [Target]
 USING StagingCustomer AS [Source]
    ON Target.Email        = Source.Email
   AND Target.IsRowCurrent = 1
 WHEN MATCHED AND
     (
          Target.FirstName <> Source.FirstName
       OR Target.LastName  <> Source.LastName
       OR Target.Title     <> Source.Title
     )
     THEN UPDATE SET
        IsRowCurrent     = 0
       ,LastUpdated      = GETDATE()
       ,ValidTo          = GETDATE()
 WHEN NOT MATCHED BY TARGET
     THEN INSERT (
         FirstName
        ,LastName
        ,Title
        ,DoB
        ,Email
        ,LastUpdated
        ,IsRowCurrent
        ,ValidFrom
        ,ValidTo
       ) VALUES (
         Source.FirstName
        ,Source.LastName
        ,Source.Title
        ,Source.DoB
        ,Source.Email
        ,GETDATE()      --LastUpdated
        ,1              --IsRowCurrent
        ,GETDATE()      --ValidFrom
        ,'9999-12-31'   --ValidTo
       )
 WHEN NOT MATCHED BY SOURCE
     THEN UPDATE SET
         IsRowCurrent = 0
        ,LastUpdated  = GETDATE()
        ,ValidTo      = GETDATE()

The ‘When Matched’ section includes extra clauses which define which fields should be treated as Type 2.

The ‘When Not Matched By Target’ section deals with inserting the new records which didn’t previously exist.

The ‘When Not Matched By Source’ section deals with terminating records which are no longer received from the source. Usually this section can be deleted, especially if the data is received incrementally.

We then add an OUTPUT clause to the end of the statement

 OUTPUT $action AS Action
       ,Source.*

The OUTPUT clause tells MERGE to generate an output dataset. This can consist of any of the Source table’s fields or the Target table’s fields. We can also specify $Action as an extra field which will identify, for each row, whether

it was dealt with via an INSERT, UPDATE or DELETE. For this purpose we only care about the UPDATES, so we’ll use this to filter the records later on. We also only need the Source data, not the Target, so we’ll return Source.*

We wrap this up within an INSERT statement which will insert the new record for the changed dimension member.

INSERT INTO Customer
   ( FirstName
    ,LastName
    ,Title
    ,DoB
    ,Email
    ,LastUpdated
    ,IsRowCurrent
    ,ValidFrom
    ,ValidTo
   )
SELECT
     FirstName
    ,LastName
    ,Title
    ,DoB
    ,Email
    ,GETDATE()    --LastUpdated
    ,1            --IsRowCurrent
    ,GETDATE()    --ValidFrom
    ,'9999-12-31' --ValidTo
FROM (
  MERGE Customer        AS [Target]
  USING StagingCustomer AS [Source]
     ON Target.Email        = Source.Email
    AND Target.IsRowCurrent = 1
  WHEN MATCHED AND
      (
           Target.FirstName <> Source.FirstName
        OR Target.LastName  <> Source.LastName
        OR Target.Title     <> Source.Title
      )
      THEN UPDATE SET
         IsRowCurrent     = 0
        ,LastUpdated      = GETDATE()
        ,ValidTo          = GETDATE()
  WHEN NOT MATCHED BY TARGET
      THEN INSERT (
          FirstName
         ,LastName
         ,Title
         ,DoB
         ,Email
         ,LastUpdated
         ,IsRowCurrent
         ,ValidFrom
         ,ValidTo
        ) VALUES (
          Source.FirstName
         ,Source.LastName
         ,Source.Title
         ,Source.DoB
         ,Source.Email
         ,GETDATE()      --LastUpdated
         ,1              --IsRowCurrent
         ,GETDATE()      --ValidFrom
         ,'9999-12-31'   --ValidTo
        )
  WHEN NOT MATCHED BY SOURCE
      THEN UPDATE SET
          IsRowCurrent = 0
         ,LastUpdated  = GETDATE()
         ,ValidTo      = GETDATE()
  OUTPUT $action AS Action
        ,[Source].*
) AS MergeOutput
WHERE MergeOutput.Action = 'UPDATE'
  AND Email IS NOT NULL
;

Note that the output clause is restricted so we only return the ‘UPDATE’ rows. As we’re using the email field as the business key, we should also ensure that we only insert records which have a valid email address.

So Type 2 changes have now been dealt with, by terminating the old version of the record and inserting the new version. Type 0 fields are just left out of the entire process, so are taken care of by just ignoring them. Therefore the only thing left is to manage the Type 1 fields.

We have two options here;

  • Update all historical records to the new value
  • Update only the current record to the new value

These are obviously only valid when there is a mix of type 1 and 2 attributes. If we’re just looking at Type 1 then there will be no historical records. In a true Type 1 scenario the first option is correct. All history (of Type 1 fields) is lost. The second option can be a valid option when it would be beneficial to keep a limited history of Type 1 fields.

This would mean that historical records created by Type 2 changes also keep a record of the Type 1 attribute values that were valid at the time the record was terminated. It doesn’t keep a full history of Type 1 attributes but sometimes this can be useful.

  UPDATE C
      SET DoB         = SC.DoB
         ,LastUpdated = GETDATE()
  FROM Customer C
     INNER JOIN StagingCustomer SC
          ON C.Email        =  SC.Email
       --AND C.IsRowCurrent =  1      --Optional
         AND C.DoB          <> SC.DoB

This block of code updates the Type 1 attributes (in this case, DoB). The line 7 (the IsRowCurrent) check is optional depending on whether you only want to update current or all records.

So in one SQL statement we’ve managed the entire load process of all Type 2 SCDs, and with one more we’ve also managed all Type 1 fields.

I’ve been performing a large number of performance tests on loading Type 2s using various methods (another blog post to follow, as well as a talk that I’ll be presenting at SQL Bits X), and the performance of this method is very fast. In fact there’s very little difference in performance between using this method and using the SSIS Merge Join component.

This is now my preferred approach to loading Type 2 SCDs, slightly faster methods may be available, but as we’ll see in later blog posts, this is such a quick method to implement, as well as being incredibly flexible as it can be controlled entirely from metadata.

Long live the Merge statement!

Frog-Blog Out

A number of Frog-Blog posts over the next couple of months are going to make heavy use of the awesome SQL Server MERGE statement, introduced in SQL Server 2008. I thought it best to write an introduction post to provide the basics and groundwork for future posts.

So what does the Merge statement do?

Merge upserts data into a table. If, like me, you find the word ‘upsert’ more than a little ridiculous, we can scrap it in favour of a slightly more sensible description; It both inserts and updates data at the same time. Much better.

How does it do it?

In a similar way in which we’d use an UPDATE statement, we need to provide a source table and a destination table. We then give it a method of matching rows between them, usually the primary key or business/natural key (this can use multiple fields).

We can then specify a number of actions depending on whether a match is found or not.

If a match is found we may want to update the existing record.

If a match isn’t found then it’s likely that we’ll want to insert a new record.

What’s the syntax?

First we set the tables and the joining key(s)

MERGE [Destination Table] AS Target
   USING [Source Table] AS Source
     ON Target.KeyField = Source.KeyField

Then we define what we want to happen if we find a match (optional)

  WHEN MATCHED

If you want you can specify other conditions, such as checking whether anything has changed

     AND  (
         Target.Field1 <> Source.Field1
         OR Target.Field2 <> Source.Field2
         )

We can then perform an action. i.e. an update, delete, etc.

  THEN UPDATE SET
        Field1 = Source.Field1
      , Field2 = Source.Field2

Then we define what we want to happen if we don’t find a matching record (optional)

   WHEN NOT MATCHED

In this case, we want to insert the new record

      THEN INSERT (
             KeyField
           , Field1
           , Field2
        ) VALUES (
             Source.KeyField
           , Source.Field1
           , Source.Field2
      );

As this is the last part of the statement we have to terminate the merge with a semi colon.

Time for an example

We’ve got a customer table, and we receive an updated set of customer details which we want to import into our table. Assuming we’ve imported the new details into a staging table, we can use merge to do the work for us.

Use this script to create the necessary tables & data.

Our existing Customer table should look like this

And our staging table (new customer data) should look like

Firstly lets define the logic that we’re looking for.

  • Our new customer data doesn’t include the CustomerID, so we can’t match on the primary key. Instead we’ll use the email address as the key.
  • If the customer already exists then we want to check whether the fields are different, if they are then update the customer record.
  • If the customer doesn’t exist then insert them as a new customer record.
  • The Customer table contains audit fields which must be maintained. DTInserted should contain the datetime when the record was first created. DTUpdated should be updated whenever any of the data in a row is changed.

This is all performed using the following statement

MERGE Customer        AS [Target]
USING StagingCustomer AS [Source]
   ON Target.Email = Source.Email
WHEN MATCHED AND
    (
         Target.FirstName <> Source.FirstName
      OR Target.LastName  <> Source.LastName
      OR Target.Title     <> Source.Title
      OR Target.DoB       <> Source.DoB
    )
    THEN UPDATE SET
       FirstName = Source.FirstName
      ,LastName  = Source.LastName
      ,Title     = Source.Title
      ,DoB       = Source.DoB
      ,DTUpdated = GETDATE()
WHEN NOT MATCHED
    THEN INSERT (
        Email
       ,FirstName
       ,LastName
       ,Title
       ,DoB
       ,DTInserted
       ,DTUpdated
      ) VALUES (
        Source.Email
       ,Source.FirstName
       ,Source.LastName
       ,Source.Title
       ,Source.DoB
       ,GETDATE()
       ,GETDATE()
      );

It’s a long statement and looks complex at first glance, but as soon as you break it down it’s remarkably simple and easy to write.

If you run this, you should see 4 rows affected – 2 new rows inserted (Jack & Mary) and 2 rows updated (Phil & Jane). The new data has been inserted/updated into the Customer table and the audit fields have been kept up to date.

Note however that (if you look at the data in the sample scripts) Sarah & Andy exist in the Customer table but are missing from the new customer list in the update file. With the code above they just stay as they are in the Customer table, we assume they haven’t changed. What if we wanted to identify these customers and flag them as inactive or even delete them?

If we look at the ‘WHEN NOT MATCHED’ line, what this is actually doing is defaulting to ‘WHEN NOT MATCHED BY TARGET’. i.e. if you don’t find the record in the target table then do something.
We can add an extra section for ‘WHEN NOT MATCHED BY SOURCE’ which will allow us to specify an action when a row in our Customer table doesn’t have a matching record in the source staging data.

We can add:

WHEN NOT MATCHED BY SOURCE
    THEN DELETE

Which will simply delete the row in the Customer table, or we could update the record and terminate it using something like this.

WHEN NOT MATCHED BY SOURCE
    THEN UPDATE SET
        IsActive = 0

Don’t forget that the entire merge statement needs to be terminated with a semi colon, and we should also update the audit field. We also only want to update the record once, so check that the customer isn’t already terminated. We therefore end up with:

MERGE Customer        AS [Target]
USING StagingCustomer AS [Source]
   ON Target.Email = Source.Email
WHEN MATCHED AND
    (
         Target.FirstName <> Source.FirstName
      OR Target.LastName  <> Source.LastName
      OR Target.Title     <> Source.Title
      OR Target.DoB       <> Source.DoB
    )
    THEN UPDATE SET
       FirstName = Source.FirstName
      ,LastName  = Source.LastName
      ,Title     = Source.Title
      ,DoB       = Source.DoB
      ,DTUpdated = GETDATE()
WHEN NOT MATCHED BY TARGET
    THEN INSERT (
        Email
       ,FirstName
       ,LastName
       ,Title
       ,DoB
       ,IsActive
       ,DTInserted
       ,DTUpdated
      ) VALUES (
        Source.Email
       ,Source.FirstName
       ,Source.LastName
       ,Source.Title
       ,Source.DoB
       ,1
       ,GETDATE()
       ,GETDATE()
      )
WHEN NOT MATCHED BY SOURCE
    AND Target.IsActive=1
    THEN UPDATE SET
        IsActive = 0
       ,DTUpdated = GETDATE()
      ;

Which leaves our Customer table looking like this

There are many variations and uses for this, we’ll explore its use in data warehouse dimension loading over the next couple of posts.

For now, Frog-Blog out.

[UPDATE] The next blog post in this series is now available: using T-SQL Merge to load Type 2 SCDs into a data warehouse dimension.

Just a quick tidbit today, but a very useful one.
Ever need to do SELECT Count(*) on a very large table? It can take ages.

Instead, use sp_spaceused

EXEC sp_spaceused 'MyTable'

It returns you a useful result set like

name rows reserved data index_size unused
MyTable 19690290 13637488 KB 3001688 KB 9986416 KB 649384 KB

Not only does it also give you extra useful information about the table, but it comes back virtually instantly.

Columns:
  name – name of the object requested
  rows – the number of rows in the table (or service broker queue)
  reserved – the total reserved space for the object
  data – amount of space used by the data
  index_size – amount of spaced used by indexes in the object
  unused – total space reserved for the object but not yet used

You can also use sp_spaceused without any parameters, and it will return you the stats for the whole database.

Frog-Blog Out

PowerPivot Vs QlikView 101 – SQLBits Video

The video is now available from my PowerPivot and QlikView talk at SQLBits 9 in Liverpool in September 2011. You can download or watch the video here.


In this 1 hour session I create interactive dashboards from scratch in both PowerPivot and QlikView, showing how to set up the data model, overcome common pitfalls and build the dashboards. I create almost identical dashboards in both systems and highllight the pros and cons of each system.


PowerPivot QlikView


You can find the scripts and code to go with this video in this blog post


SQLBits 10 is coming!!

Today the SQLBits organisers have announced that SQLBits 10 will be held in London between 29th – 31st March 2012 in the Novotel London West. It’s going to be even bigger and better, so keep an eye on the website and get your tickets early.

Thank you to the SQLBits committee, speakers, helpers and attendees, for making SQLBits 9 one of the best yet. What a great example SQLBits is of the power of the SQL Server community coming together to help each other, and enjoy a few beers in the process!

QlikView PowerPivotI was delighted to have my “PowerPivot & QlikView 101″ session chosen by the committee, which I presented on Saturday afternoon. Thank you to all those who attended, great to see such a good number there. If you haven’t yet submitted your feedback for my session, and the event as a whole, please do take the time to do it. It helps everyone improve the quality of the event for next time.

For those who atttended my session, here are the slides, with a summary of the results and code etc.

Slides: Download the PowerPivot & QlikView 101 slide deck here

QlikView Script: Download the QlikView Script here

PowerPivot Workboook: Download the PowerPivot workbook here

Session Video: The session recording isn’t yet available, I’ll post a link here when it’s ready.

QlikView Expressions

Sales Amount Visual Cue for the green shading
   Upper: (SUM(TOTAL SalesAmount) / COUNT(TOTAL SalesTerritoryRegion))*1.2
   Lower: (SUM(TOTAL SalesAmount) / COUNT(TOTAL SalesTerritoryRegion))*0.8

Sales Amount blue bar Guage maximum
   max(ALL Aggr(SUM(SalesAmount),OrderMonth))

Sales Amount Year to Date
   rangesum(above(sum(SalesAmount),0,12))

Sales Amount Previous Year
   SUM({<OrderYear={‘$(=Max(OrderYear)-1)’}>} SalesAmount)

PowerPivot DAX Expressions

Sales Amount Year to Date
   =TOTALYTD(SUM(FactInternetSales[SalesAmount]), DimOrderDate[FullDateAlternateKey], ALL(DimOrderDate))

Sales Amount Previous Year
   =CALCULATE(SUM(FactInternetSales[SalesAmount]), SAMEPERIODLASTYEAR(DimOrderDate[FullDateAlternateKey]), ALL(DimOrderDate))

 

PowerPivot Results

PowerPivot Dashboard

 

QlikView Results

QlikView Dashboard

If you have any questions about any of this, please get in touch.

Frog-Blog-Out

What an exciting few weeks we’ve got to look forward to!

SQL Bits starts on Thursday this week which promises to be another fantastic event. I’m speaking on the Saturday, presenting a session on QlikView and PowerPivot. When I’m not speaking I’ve got a list of duties as long as my arm, from marshelling to room monitoring, and of course representing the Community Corner.

If you’re interested in SQL events local to you, then come and say hi to us on the Community Coener stand – a number of user group leaders in the UK will be on hand to provide any info you want regarding the SQL community, and point you towards events happening in your area.

After SQLBits, we’ve got the SQL Relay to look forward to. Every SQL Server user group in the UK is putting on an event between 3rd and 5th October 2011, with a grand finale event in London at Microsoft’s Cardinal Place on the 6th. Every event has great speakers, from MVPs to Microsoft employees, and in London we have renowned author and blogger Itzik Ben-Gan. We also have a host of fantastic prizes to give away at every event.

What makes all of this even better is that there are a number of new user groups launching during this week, including Maidenhead, Cambridge and Essex. There’s a group near you so head over to www.SQLServerFAQ.com and get registered!

Look forward to seeing you at one of the events!

Back in September 2010 I spent another evening with the Cardiff SQL Server User Group, run by Adam Morton. Having got so much out of the Cardiff group since its inception, both as an attendee and as a regular speaker, it struck me as a shame that there wasn’t a group represting the SQL community in my area of the UK, the Midlands.

Following a few conversations with Adam Morton, Tony Rogerson, Chris Testa-O’Neil and others, Hollie (my patient and supportive Wife and business partner!) and I decided to launch a new SQL Server user group for the Midlands.

With a few months of planning, we invited Allan Mitchell and Neil Hambley to speak at our first event in March 2011, held in a fabulous pub in central Birmingham, the Old Joint Stock. We’re now planning our third event to be held at the start of October as part of the national SQL Relay event, and the group is growing from strength to strength with 30 attendees at the last meeting.

Now, in August 2011, we’re delighted that we’ve been accepted as an official PASS chapter (The Professional Association of SQL Server). This will help provide global recognition of the group, increasing its visiblity and hopefully generating even more awareness and attendees.

To mark the event, we’ve launched a new website for the group www.SQLMidlands.com

We’re incredibly lucky to have such an active, diverse and friendly SQL Server community in the UK. You only have to look at the phenominal success of the SQLBits conference as well as the user groups and SQLSocial. I’ve personally gained a lot from being a part of it, in terms of learning skills, networking, presenting talks and making new friends. I’d like to take this opportunity to thank the community leaders for all their hard work, especially the organisers of SQLBits, who go above and beyond the call of duty every 6 months.

And it’s not all about SQL Server, the SQLSocial event last week was pretty much just an excuse to go to the Great British Beer Festival and celebrate Simon Sabin’s birthday!

If you’ve not got involved before, ask yourself these questions:

  1. Do you want to learn more about SQL Server?
  2. Do you want to meet other SQL Server beginners or professionals, and discuss tech, career, beer, or anything else that springs to mind?
  3. Are you new to this and just want to know what’s out there?
  4. Do you want an opportunity to present a SQL Server based talk to like minded people?

If you answer yes to any of the above, then here are some resources for you

SQLServerFaq – All SQL Server User Groups in the UK

SQL Bits – Europe’s largest and best SQL Server conference, held every 6 months

SQL Social – Does what it says on the tin!

SQL Pass – SQL Server conference held in Seattle, U.S.A.

SQL Rally Nordic – SQL Server conference in Sweeden

The above events are not only top quality, with world leading speakers, some are also free. The user groups, SQLSocial and the community day of SQLBits are totally free – where else can you get that kind of training without paying a fortune?!

You’ll even get to hear me present another talk at SQLBits, I’m talking about PowerPivot and QlikView, with a level 200 introduction into self service BI and how these two products compare.

With so much going on in the UK and abroad, and with such quality, help yourself as well as the community – get yourself registered for some of the events.

Alex

One of my favourite features of SSIS is the script component, and I know I’m not alone. Why? Because it brings the entire might of the .NET framework to SSIS, providing C# (in SQL 2008 onwards) and VB.NET extensibility for all those times where SSIS doesn’t quite have enough functionality out of the box.

Such as?

Well a problem I’ve come across a number of times is string parsing. Trying to search for and extract a specific pattern of characters from a larger text string. I’ve seen developers build crazy convoluted expressions in the derived column transform, some of which are very impressive in their complexity! This is a bad thing not good, although it shows a level of prowess in building SSIS expressions (not the most intuitive expression language!), it becomes a support nightmare for other developers.

Take for example extracting a house number from the first line of an address, we want to convert “29 Acacia Road” into “29”

Or extracting a version number from a product, converting “iPhone 4G, IOS 4.3.2, black” into “4.3.2”

Or extract the html page from a URL, converting “http://www.wibble.com/folder/page.aspx” into “page.aspx”

Regular Expressions

The .NET framework includes full support for regular expressions (Regex), in the System.Text.RegularExpressions namespace. Regex provide an incredibly powerful way of defining and finding string patterns. They take some getting used to, but once you get the hang of them you can unleash their power in your SSIS dataflow pipelines.

To find out more about regular expressions, look at the following links

 

Let’s look at an example

Let’s take our first example from above, extracting a house number, converting “29 Acacia Road” into “29”.

The first thing we need to do is define our Regex search pattern. In this case we know that it must be at the start of the string, and must be an integer, with any number of characters 0-9.

The pattern for this is “^[0-9]+”, which is broken down as
    ^ means the start of the line
    [0-9] means any number
    + means 1 or more of the preceding item.
    i.e. 1 or more integers at the start of the line.

What if we wanted this to also cope with a single letter after the number? i.e. “221b Baker Street”

We can add “[A-Za-z]?” to our pattern, in which
    [A-Za-z] means any character A-Z in either upper or lower case
    ? means 0 or 1 occurrences of this

We should also add “\b” to the end of this, which indicates a word boundary. This means that 221b should be a whole word, not part of a larger sequence “221BakerSt”. We can wrap up the [A-Za-z]\b together into brackets so that the ? applies to the combination, so that any single letter must be the end of a word or it will be ignored. In this way “221BakerSt” will return 221, as will “221 Baker St”, whereas “221B Baker St” will return “221B”.

So our new pattern is “^[0-9]+([A-Za-z]\b)?”

You’ve probably gathered by now that regular expressions can get quite complicated. I’m not going to go into any more detail about them here, but hopefully this gives you some idea of what they can do. There’s plenty of reading on the web if you want to know more. You should also make use of the Regex expression tester in the link above – it will save you lots of debugging!

How do we use Regular Expressions in SSIS?

Well it turns out this is the easy bit, with the help of the script component.

Step 1 – Add a script component into your data flow pipeline, configure it as a data transform. I’m using C#, but you can use VB.NET if you want

Step 2 – Give the script access to the input and output columns

Open the script component and select the input field from the “Input Columns” screen, in this case “Address1”. This can be ReadOnly.

Go to the “Inputs and Outputs” screen and add an output column to “Output 0”. We want to set the datatype to string, length 10. This new field will contain the results of our Regex pattern matching.

Step 3 – Create the script code

Click on “Edit Script” on the Script screen which will open up Visual Studio.

Add a reference to System.Text.RegularExpressions at the top of the script


      using System.Text.RegularExpressions;

Then place the necessary code in the Input0_ProcessInputRow function.


    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        //Replace each \ with \\ so that C# doesn't treat \ as escape character
        //Pattern: Start of string, any integers, 0 or 1 letter, end of word
        string sPattern = "^[0-9]+([A-Za-z]\\b)?";
        string sString = Row.Address1 ?? ""; //Coalesce to empty string if NULL

        //Find any matches of the pattern in the string
        Match match = Regex.Match(sString, sPattern, RegexOptions.IgnoreCase);
        //If a match is found
        if (match.Success)
            //Return the first match into the new
            //HouseNumber field
            Row.HouseNumber = match.Groups[0].Value;
        else
            //If not found, leave the HouseNumber blank
            Row.HouseNumber = "";
    }




When you save and exit the script, any component downstream of the script component will have access to the new HouseNumber field.

Flog-Blog-Out

SQLBits 9

Submit a session for SQLBits

It only seems like yesterday since the SQLBits crew put on their awesome 8th conference in Brighton. The next conference has just been announced, SQLBits will move to Liverpool for SQLBits 9 – “Query Across The Mersey” (…Really?!) between September 29th and October 1st.

Registration’s not yet open, but you can find out more at www.SQLBits.com

I’ve already submitted two sessions to present, “PowerPivot & QlikView 101″ and “Advanced data modelling for data warehousing and cubes”. Make sure you vote for them when voting opens! You can read full summaries of all sessions submitted so far here.

SQL User Groups

We’ve also got a busy time coming up with the UK SQL Server user groups. I’ll be running the next Birmingham user group the week after SQLBits, on Monday 3rd October. The agenda and speakers have yet to be set, but it should be a goodun!

I’ll also be speaking at the Southampton user group in 2 weeks time, on Wednesday 6th July. I’ll be presenting my 3 part blog post on automating SSAS cube documentation. Matt Whitfield will also be presenting, so it should be a good night. Now stop reading this and go and register!!

Automating OLAP cube documentation – SQLBits presentation

For anyone that missed my presentation at SQLBits 8 in April, the video is now available here.


In this 1 hour session I present a method of automating the creation of documentation for SSAS OLAP cubes by using DMVs (dynamic management views) and spatial data, querying the metadata of the cube in realtime.

The results include the BUS matrix, star schemas, attribute lists, hierarchies etc. and are all presented in SSRS.


The blog posts to go with this are here:


You can view the slide deck here

The Frog Blog

I'm Alex Whittles.

I specialise in designing and implementing SQL Server business intelligence solutions, and this is my blog! Just a collection of thoughts, techniques and ramblings on SQL Server, Cubes, Data Warehouses, MDX, DAX and whatever else comes to mind.

Submit a session for SQLBits

Frog Blog Out
twitter
rssicon