Posts Tagged ‘SSIS’

Recently I’ve been using LogParser a lot to import log files, such as W3C logs from an IIS web server. The distasteful format of these files is made palatable by the awesomeness that is LogParser; it just takes care fo it for you. (Check out this SQLBits session on LogParser by Jonathan Allen (Twitter | Blog) for a good intro to LogParser)

However I’ve been suffering with very poor performance with large files, so started to investigate further.

Exporting a 5Mb IIS log to a csv file was taking 0.4 seconds, whereas exporting the same log to SQL Server was taking over 15 times longer. This isn’t a problem for a small file, but processing a 0.5Gb log file is somewhat of a different matter!

Performing a trace on SQL Server during an import was revealing that each record being inserted was wrapped up in its own transaction. Importing 100,000 records into SQL was generating 100,000 insert statements, and 100,000 transactions. In the trace below you can see the transactions either side of the insert (using exec sp_cursor…).

Log Parser Transactions

This was executed using the following command

  • LogParser.exe “SELECT * INTO staging.test FROM Test.log” -i:IISW3C -o:sql -server:localhost -database:Staging -cleartable:ON

 

I have to admit that I’m surprised by this – I assumed that LogParser would open up a single transaction for the bulk insert. It turns out however that you can control this behaviour using LogParser’s transactionRowCount parameter.

The ‘-transactionRowCount’ parameter defaults to 0, which auto commits every single row. It can be changed to any number you like, or to -1, which will commit all rows in a single transaction. The choice is yours.

By changing this parameter, the performance difference can be huge on a large file.

The following is an extract of the trace on the same file when executed using the following command

  • LogParser.exe “SELECT * INTO staging.test FROM Test.log” -i:IISW3C -o:sql -server:localhost -database:Staging -cleartable:ON -transactionRowCount:-1

Log Parser Single Transaction

You can clearly see that only a single transaction is opened, with all inserts batched together.

For processing large files this can make a very significant different. For an 11Mb IIS log with just over 18k records on one of my test servers:

  • Default behaviour: 38.26 seconds
  • Single transaction: 2.92 seconds

This is a performance improvement of 92% – which you can’t grumble about.

If you’re calling LogParser from C# script within SSIS (as I usually do) then you can set this same behaviour using the transactionRowCount property of the LogQueryClassClass object (before you call ExecuteBatch!).

 

Frog-Blog-Out

© Alex Whittles, Purple Frog Systems Ltd

I stumbled upon a gem of a tool this morning, written by Darren Green (blog | twitter) of SQLBits & Bids Helper fame; the SSIS Expression Editor and Tester.

Kudos to Darren, this is a great little tool for fine tuning and testing those complex SSIS expressions without having to keep running tests within SSIS. It’s downloadable with 3 versions, for SSIS 2005, 2008 & 2012, and doesn’t even need installing, it just runs as an exe.

SSIS-tastic.

© Alex Whittles, Purple Frog Systems Ltd

Well after 3.5 years, I’ve finally completed my MSc Business Intelligence – hoorah! And to reward the time, effort and increased grey hair, they saw fit to give me a merit as well.

During the last year I’ve been writing a thesis investigating the performance characteristics of loading data into data warehouse dimensions. Specifically loading Type 2 SCDs using SSIS.

For those who have followed the previous posts and my conference talks on using T-SQL Merge for the purpose, you won’t be surprised at the direction of the dissertation, but it provides a useful performance comparison between T-SQL Merge, SSIS Merge Join, SSIS Lookup and the SSIS SCD Wizard.

I won’t go into the full details here of the project or results, but will show a couple of the summary charts which are of most interest. You can download the full project here:

The charts below shows the duration taken for the Lookup, Merge and Merge-Join methods (SCD Wizard excluded for obvious reasons!).
The top chart shows the performance on a Raid 10 array of traditional hard disks.
The second chart shows the same tests run on a Fusion IO NAND flash card.

The charts clearly show that the Lookup method is the least favoured. Of the other two, Merge is [just] preferred when using solid state, although statistically they are equivalent. On HDDs, Merge and Merge-Join are equivalent until you’re loading 2-3m rows per batch, at which point Merge-Join becomes the preferred option.

Full test results and analysis in the PDF download above.

My previous few posts show how using a T-SQL approach like Merge can provide huge development benefits by automating the code. This research now shows that unless you’re loading very large data volumes the performance is equivalent to more traditional approaches.

Hope this is of use. If you want to know a bit more without reading the full 99 pages & 23k words (who could blame you?!), then my SQLBits talk video is now on-line here. This talk is slightly out of date as it was presented before I’d finished the research and analysis, but it’s largely accurate. I presented a more up to date version on a webinar for the PASS Virtual BI chapter. The recording isn’t currently available [When this post was written] but should be up soon. Keep checking on the BI PASS Chapter website.

Frog-Blog-Out

© Alex Whittles, Purple Frog Systems Ltd

SQLBits X Video Now available

The video of my talk at SQLBits X is now available on the SQLBits website here. The talk was focused on presenting the results of my MSc Business Intelligence dissertation, comparing the performance of different methods of using SSIS to load data warehouse dimensions, specifically type 2 SCDs.

The talk also covers a comparison of the performance between traditional hard disks and solid state storage systems such as Fusion IO.

I then present a method of using the T-SQL Merge statement to automate a significant part of the ETL process.

You can find the code behind the demos on various recent Frog-Blog posts, and there is more to come, so look back soon!

 

PASS BI Virtual Chapter Talk

For those that would rather hear the talk presented live, or want to ask any questions, please join me at a repeat of this talk over Live Meeting for the PASS BI Virtual Chapter on Thursday 28th June, at 5pm UK time, 12pm EDT (US). You can find the details on the PASS BI chapter website here

© Alex Whittles, Purple Frog Systems Ltd

This is the 3rd post in the Frog-Blog series on the awesomeness of T-SQL Merge.

In this post we’ll be looking at how we can automate the creation of the merge statement to reduce development time and improve reliability and flexibility of the ETL process. I discussed this in the 2nd half of a talk I gave at the UK technical launch of SQL Server 2012 at SQLBits X. Thank you to the great audience who came to that talk, this post is for your benefit and is a result of the feedback and requests from you guys.

Why automate merge?

As we saw in the previous post, merge is an incredibly powerful tool when loading data into data warehouse dimensions (specifically SCDs – slowly changing dimensions). The whole process can be wrapped up into a very neat stored proc which can save a considerable mount of time writing the equivalent functionality in SSIS. In the next installment of this series I’ll be discussing the performance of it compared to other methods of loading SCDs in SSIS (take a look at the SQLBits talk video [when it's released] for a preview!). Suffice to say for now that in my [pretty comprehensive] tests it’s one of the fastest methods of loading SCDs.

If you missed the talk, you can download the slide deck here whilst you’re waiting for the video.

The problem that stops a lot of people using merge is the perceived complexity of the statement. It can be very easy to get things wrong, with pretty bad consequences on your dimension data.

The easiest way to avoid this complexity and simplify the process is to not write merge statements, but let an automated procedure to it for you – Simples!.

The other huge benefit is that, as we’ll see during this post, you can base the automation procedure on metadata, meaning that you can change the SCD functionality of your dimensions just by changing metadata, and not rewriting your code.

Note that in this post we’ll just be looking at Type 0 and 1 SCDs, not 2, 3 or 6. This is to keep things simple. Once you’ve mastered type 0 and 1, it’s a logical next step to expand things to deal with type 2s.

OK, so how do we do this?

First of all we need to set up two tables to use. Let’s create a simple Customer dimension. Alongside this we also need a staging table. I’m a big fan of using schemas to differentiate tables, so we’ll create dim.Customer and etl.Customer as our two tables.

CREATE SCHEMA [dim] AUTHORIZATION [dbo]
GO
CREATE SCHEMA [etl] AUTHORIZATION [dbo]
GO

CREATE TABLE [dim].[Customer](
    [CustomerKey]   [int] IDENTITY(1,1) NOT NULL,
    [Email]         [varchar](255)      NOT NULL,
    [FirstName]     [varchar](50)       NOT NULL,
    [LastName]      [varchar](50)       NOT NULL,
    [DoB]           [date]              NOT NULL,
    [Sex]           [char](1)           NOT NULL,
    [MaritalStatus] [varchar](10)       NOT NULL,
    [FirstCreated]  [date]              NOT NULL,
    [IsRowCurrent]  [bit]               NOT NULL,
    [ValidFrom]     [datetime]          NOT NULL,
    [ValidTo]       [datetime]          NOT NULL,
    [LastUpdated]   [datetime]          NOT NULL
 CONSTRAINT [PK_DimCustomer] PRIMARY KEY CLUSTERED 
(
	[CustomerKey] ASC
))
GO

CREATE TABLE [etl].[Customer](
    [Email]         [varchar](255)  NOT NULL,
    [FirstName]     [varchar](50)   NOT NULL,
    [LastName]      [varchar](50)   NOT NULL,
    [DoB]           [date]          NOT NULL,
    [Sex]           [char](1)       NOT NULL,
    [MaritalStatus] [varchar](10)   NOT NULL,
    [FirstCreated]  [date]          NOT NULL
)

So the dim table contains our primary surrogate key, business key (email address in this case), customer details and a series of audit fields (IsRowCurrent, ValidFrom, etc.). The etl staging table only contains the business key and customer details.

We then need to store the details of each field. i.e. how should each field be interpreted – is it a primary key, business, key, type 0 or 1, or an audit field. We need this so that we can put the correct fields into the correct place in the merge statement. You could create a table to store this information, however I prefer to use the extended properties of the fields.

EXEC sys.sp_addextendedproperty @level2name=N'CustomerKey',  @value=N'PK' ,    
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'Email',        @value=N'BK' ,    
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'FirstName',    @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'LastName',     @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'DoB',          @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'Sex',          @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'MaritalStatus',@value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'FirstCreated', @value=N'1' ,     
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'ValidFrom',    @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'ValidTo',      @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'IsRowCurrent', @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'
EXEC sys.sp_addextendedproperty @level2name=N'LastUpdated',  @value=N'Audit' , 
    @name=N'SCD', @level0type=N'SCHEMA',@level0name=N'Dim', 
    @level1type=N'TABLE',@level1name=N'Customer', @level2type=N'COLUMN'

Or you can obviously just enter the extended property manually using SSMS

The SSIS package should output all customer records into the etl table, with no regard for whether they are new customers, old customers, changed or not. The merge statement will take care of that.

Automating Merge

The first stage is to examine the structure of merge.

   MERGE   [DIMENSION TABLE]  as Target
   USING   [STAGING TABLE]    as Source
      ON   [LIST OF BUSINESS KEY FIELDS]
   WHEN MATCHED AND
         Target.[LIST OF TYPE 1 FIELDS] <> Source.[LIST OF TYPE 1 FIELDS]
      THEN UPDATE SET
         [LIST OF TYPE 1 FIELDS] = Source.[LIST OF TYPE 1 FIELDS]
   WHEN NOT MATCHED THEN INSERT
         [LIST OF ALL FIELDS]
      VALUES
         Source.[LIST OF ALL FIELDS]

The text in black is the skeleton of the statement, with the text in red being the details specific to the dimension. It’s these red items which we need to retrieve from the metadata of the dimension in order to create the full merge statement.

We can retrieve the extended properties using the sys.extended_properties DMV. This allows us to pull out a list of all fields which have a specific extended property set, e.g. all PK fields, all BK fields, all type 2 fields etc. etc. If we then put a few of these queries into cursors, we can loop through them and build up a dynamic SQL query. Yes I know, dynamic SQL should be avoided and is evil etc., however… this use is an exception and does truly make the World a better place.

I’m not going to explain the resulting proc in minute detail, so instead please just download it here and work through it yourself. I will however explain a couple of items which are pretty important:

It’s important to keep the naming convention of your dimensions consistent. This doesn’t mean that every dimension must be identical, some may need inferred member support, some may need type 2 tracking fields (e.g. IsRowCurrent) and some may not; the critical thing is that all of your fields, if they do exist, should be named consistently. The automation proc can then look for specific field names and include them in the merge statement if necessary.

There is a parameter in the proc called @Execute. This offers the possibility of either executing the resulting merge statement directly, or just printing out the statement. If you only want to use this to automate the development process then this allows you to do just that, you can then just copy and paste the resulting statement into SSIS or into a stored proc.

Result

The automated generation of T-SQL merge statement to handle type 0 & 1 SCDs!
Hopefully you can see how you can expand this to also cope with Type 2 SCDs, following the structure in my earlier posts.

Download the SQL scripts here
 
Frog-Blog Out

© Alex Whittles, Purple Frog Systems Ltd

In my last blog post I showed the basic concepts of using the T-SQL Merge statement, available in SQL Server 2008 onwards.

In this post we’ll take it a step further and show how we can use it for loading data warehouse dimensions, and managing the SCD (slowly changing dimension) process. Before we start, let’s have a quick catch up on what an SCD is…

What is a Slowly Changing Dimension (SCD)?

If you want a full explanation of slowly changing dimensions then you’ve come to the wrong place, I’m assuming a moderate level of experience of SCDs here, check out Wikipedia for some background, but in short, they manage the tracking of attribute history in dimensional data warehouses.

Most data warehouses contain type 0, 1 and 2 SCDs, so we’ll cope with those for now.

  • Type 0 – Ignore updates
  • Type 1 – Only keep latest version
  • Type 2 – Track history by creating a new row

Type 2 is commonly stored in a fashion similar to this.

Both records show the same customer but in this case Jane got married and changed her name & title. We terminate the old record by setting IsRowCurrent=0 and create a new record with the new details. Each row also contains ValidFrom and ValidTo dates which allow us to identify the correct record for a particular point in time.

That’s enough of that, let’s get on with doing this using Merge

Using Merge to load SCD

The first stage is to save the output rows from the ETL process to a staging table. We can then use Merge to process these into the live dimension.

We saw in the previous post how to either insert or update a record depending on whether it already exists. We can start with this and enhance as we go. First lets figure out what logic we want to perform

  • If the record doesn’t exist, create it
  • If the record does exist
    • Type 0 fields – ignore
    • Type 1 fields – update fields
    • Type 2 fields – terminate existing record, insert a new record
  • If the record exists in the dimension, but not in the updated source file – terminate record

The last option is rarely used in my experience, as it only works when you perform a full load of the dimension every time. It’s more common to process an incremental load, but I’ve included it here for completeness.

The main difference here, over a basic upsert, is the handling of type 2s; we need to perform two separate operations on the dimension for every incoming record. Firstly we terminate the existing row then we have to insert a new row.

The T-SQL Merge statement can only update a single row per incoming row, but there’s a trick that we can take advantage of by making use of the OUTPUT clause. Merge can output the results of what it has done, which in turn can be consumed by a separate INSERT statement.

We’ll therefore use the MERGE statement to update the existing record, terminating it, and then pass the relevant source rows out to the INSERT statement to create the new row.

Let’s look at an example. Download the code here which will create the necessary tables and data to work on.

Main Merge Statement

We’ll start with a statement very similar to the previous post, with only a couple of minor amendments:

  • We include IsRowCurrent into the joining clause. We only ever want to update the current records, not the history.
  • DoB is removed from the WHEN MATCHED clause. We’re going to treat DoB as a type 1 change, if it’s updated then we assume it’s a correction rather than a new date of birth which should be tracked. We’ll deal with this Type 1 later on
  • The UPDATE statement in the WHEN MATCHED clause doesn’t change the fields, only terminates the row by setting the IsRowCurrent and ValidTo fields (as well as LastUpdated)
 MERGE Customer        AS [Target]
 USING StagingCustomer AS [Source]
    ON Target.Email        = Source.Email
   AND Target.IsRowCurrent = 1
 WHEN MATCHED AND
     (
          Target.FirstName <> Source.FirstName
       OR Target.LastName  <> Source.LastName
       OR Target.Title     <> Source.Title
     )
     THEN UPDATE SET
        IsRowCurrent     = 0
       ,LastUpdated      = GETDATE()
       ,ValidTo          = GETDATE()
 WHEN NOT MATCHED BY TARGET
     THEN INSERT (
         FirstName
        ,LastName
        ,Title
        ,DoB
        ,Email
        ,LastUpdated
        ,IsRowCurrent
        ,ValidFrom
        ,ValidTo
       ) VALUES (
         Source.FirstName
        ,Source.LastName
        ,Source.Title
        ,Source.DoB
        ,Source.Email
        ,GETDATE()      --LastUpdated
        ,1              --IsRowCurrent
        ,GETDATE()      --ValidFrom
        ,'9999-12-31'   --ValidTo
       )
 WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1
     THEN UPDATE SET
         IsRowCurrent = 0
        ,LastUpdated  = GETDATE()
        ,ValidTo      = GETDATE()

The ‘When Matched’ section includes extra clauses which define which fields should be treated as Type 2.

The ‘When Not Matched By Target’ section deals with inserting the new records which didn’t previously exist.

The ‘When Not Matched By Source’ section deals with terminating records which are no longer received from the source. Usually this section can be deleted, especially if the data is received incrementally.

*** UPDATE *** Thank you to Sergey (in the comments below) for pointing out an error in this code. I’ve now corrected the ‘WHEN NOT MATCHED BY SOURCE’ line to include ‘AND Target.IsRowCurrent=1′. If this is omitted then all historic (IsRowCurrent=0) records are always updated with today’s date. We only want to terminate current records, not already terminated records.

We then add an OUTPUT clause to the end of the statement

 OUTPUT $action AS Action
       ,Source.*

The OUTPUT clause tells MERGE to generate an output dataset. This can consist of any of the Source table’s fields or the Target table’s fields. We can also specify $Action as an extra field which will identify, for each row, whether

it was dealt with via an INSERT, UPDATE or DELETE. For this purpose we only care about the UPDATES, so we’ll use this to filter the records later on. We also only need the Source data, not the Target, so we’ll return Source.*

We wrap this up within an INSERT statement which will insert the new record for the changed dimension member.

INSERT INTO Customer
   ( FirstName
    ,LastName
    ,Title
    ,DoB
    ,Email
    ,LastUpdated
    ,IsRowCurrent
    ,ValidFrom
    ,ValidTo
   )
SELECT
     FirstName
    ,LastName
    ,Title
    ,DoB
    ,Email
    ,GETDATE()    --LastUpdated
    ,1            --IsRowCurrent
    ,GETDATE()    --ValidFrom
    ,'9999-12-31' --ValidTo
FROM (
  MERGE Customer        AS [Target]
  USING StagingCustomer AS [Source]
     ON Target.Email        = Source.Email
    AND Target.IsRowCurrent = 1
  WHEN MATCHED AND
      (
           Target.FirstName <> Source.FirstName
        OR Target.LastName  <> Source.LastName
        OR Target.Title     <> Source.Title
      )
      THEN UPDATE SET
         IsRowCurrent     = 0
        ,LastUpdated      = GETDATE()
        ,ValidTo          = GETDATE()
  WHEN NOT MATCHED BY TARGET
      THEN INSERT (
          FirstName
         ,LastName
         ,Title
         ,DoB
         ,Email
         ,LastUpdated
         ,IsRowCurrent
         ,ValidFrom
         ,ValidTo
        ) VALUES (
          Source.FirstName
         ,Source.LastName
         ,Source.Title
         ,Source.DoB
         ,Source.Email
         ,GETDATE()      --LastUpdated
         ,1              --IsRowCurrent
         ,GETDATE()      --ValidFrom
         ,'9999-12-31'   --ValidTo
        )
  WHEN NOT MATCHED BY SOURCE AND Target.IsRowCurrent = 1
      THEN UPDATE SET
          IsRowCurrent = 0
         ,LastUpdated  = GETDATE()
         ,ValidTo      = GETDATE()
  OUTPUT $action AS Action
        ,[Source].*
) AS MergeOutput
WHERE MergeOutput.Action = 'UPDATE'
  AND Email IS NOT NULL
;

Note that the output clause is restricted so we only return the ‘UPDATE’ rows. As we’re using the email field as the business key, we should also ensure that we only insert records which have a valid email address.

So Type 2 changes have now been dealt with, by terminating the old version of the record and inserting the new version. Type 0 fields are just left out of the entire process, so are taken care of by just ignoring them. Therefore the only thing left is to manage the Type 1 fields.

We have two options here;

  • Update all historical records to the new value
  • Update only the current record to the new value

These are obviously only valid when there is a mix of type 1 and 2 attributes. If we’re just looking at Type 1 then there will be no historical records. In a true Type 1 scenario the first option is correct. All history (of Type 1 fields) is lost. The second option can be a valid option when it would be beneficial to keep a limited history of Type 1 fields.

This would mean that historical records created by Type 2 changes also keep a record of the Type 1 attribute values that were valid at the time the record was terminated. It doesn’t keep a full history of Type 1 attributes but sometimes this can be useful.

  UPDATE C
      SET DoB         = SC.DoB
         ,LastUpdated = GETDATE()
  FROM Customer C
     INNER JOIN StagingCustomer SC
          ON C.Email        =  SC.Email
       --AND C.IsRowCurrent =  1      --Optional
         AND C.DoB          <> SC.DoB

This block of code updates the Type 1 attributes (in this case, DoB). The line 7 (the IsRowCurrent) check is optional depending on whether you only want to update current or all records.

So in one SQL statement we’ve managed the entire load process of all Type 2 SCDs, and with one more we’ve also managed all Type 1 fields.

I’ve been performing a large number of performance tests on loading Type 2s using various methods (another blog post to follow, as well as a talk that I’ll be presenting at SQL Bits X), and the performance of this method is very fast. In fact there’s very little difference in performance between using this method and using the SSIS Merge Join component.

This is now my preferred approach to loading Type 2 SCDs, slightly faster methods may be available, but as we’ll see in later blog posts, this is such a quick method to implement, as well as being incredibly flexible as it can be controlled entirely from metadata.

Long live the Merge statement!

Frog-Blog Out

© Alex Whittles, Purple Frog Systems Ltd

One of my favourite features of SSIS is the script component, and I know I’m not alone. Why? Because it brings the entire might of the .NET framework to SSIS, providing C# (in SQL 2008 onwards) and VB.NET extensibility for all those times where SSIS doesn’t quite have enough functionality out of the box.

Such as?

Well a problem I’ve come across a number of times is string parsing. Trying to search for and extract a specific pattern of characters from a larger text string. I’ve seen developers build crazy convoluted expressions in the derived column transform, some of which are very impressive in their complexity! This is a bad thing not good, although it shows a level of prowess in building SSIS expressions (not the most intuitive expression language!), it becomes a support nightmare for other developers.

Take for example extracting a house number from the first line of an address, we want to convert “29 Acacia Road” into “29”

Or extracting a version number from a product, converting “iPhone 4G, IOS 4.3.2, black” into “4.3.2”

Or extract the html page from a URL, converting “http://www.wibble.com/folder/page.aspx” into “page.aspx”

Regular Expressions

The .NET framework includes full support for regular expressions (Regex), in the System.Text.RegularExpressions namespace. Regex provide an incredibly powerful way of defining and finding string patterns. They take some getting used to, but once you get the hang of them you can unleash their power in your SSIS dataflow pipelines.

To find out more about regular expressions, look at the following links

 

Let’s look at an example

Let’s take our first example from above, extracting a house number, converting “29 Acacia Road” into “29”.

The first thing we need to do is define our Regex search pattern. In this case we know that it must be at the start of the string, and must be an integer, with any number of characters 0-9.

The pattern for this is “^[0-9]+”, which is broken down as
    ^ means the start of the line
    [0-9] means any number
    + means 1 or more of the preceding item.
    i.e. 1 or more integers at the start of the line.

What if we wanted this to also cope with a single letter after the number? i.e. “221b Baker Street”

We can add “[A-Za-z]?” to our pattern, in which
    [A-Za-z] means any character A-Z in either upper or lower case
    ? means 0 or 1 occurrences of this

We should also add “\b” to the end of this, which indicates a word boundary. This means that 221b should be a whole word, not part of a larger sequence “221BakerSt”. We can wrap up the [A-Za-z]\b together into brackets so that the ? applies to the combination, so that any single letter must be the end of a word or it will be ignored. In this way “221BakerSt” will return 221, as will “221 Baker St”, whereas “221B Baker St” will return “221B”.

So our new pattern is “^[0-9]+([A-Za-z]\b)?”

You’ve probably gathered by now that regular expressions can get quite complicated. I’m not going to go into any more detail about them here, but hopefully this gives you some idea of what they can do. There’s plenty of reading on the web if you want to know more. You should also make use of the Regex expression tester in the link above – it will save you lots of debugging!

How do we use Regular Expressions in SSIS?

Well it turns out this is the easy bit, with the help of the script component.

Step 1 – Add a script component into your data flow pipeline, configure it as a data transform. I’m using C#, but you can use VB.NET if you want

Step 2 – Give the script access to the input and output columns

Open the script component and select the input field from the “Input Columns” screen, in this case “Address1”. This can be ReadOnly.

Go to the “Inputs and Outputs” screen and add an output column to “Output 0”. We want to set the datatype to string, length 10. This new field will contain the results of our Regex pattern matching.

Step 3 – Create the script code

Click on “Edit Script” on the Script screen which will open up Visual Studio.

Add a reference to System.Text.RegularExpressions at the top of the script


      using System.Text.RegularExpressions;

Then place the necessary code in the Input0_ProcessInputRow function.


    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        //Replace each \ with \\ so that C# doesn't treat \ as escape character
        //Pattern: Start of string, any integers, 0 or 1 letter, end of word
        string sPattern = "^[0-9]+([A-Za-z]\\b)?";
        string sString = Row.Address1 ?? ""; //Coalesce to empty string if NULL
 
        //Find any matches of the pattern in the string
        Match match = Regex.Match(sString, sPattern, RegexOptions.IgnoreCase);
        //If a match is found
        if (match.Success)
            //Return the first match into the new
            //HouseNumber field
            Row.HouseNumber = match.Groups[0].Value;
        else
            //If not found, leave the HouseNumber blank
            Row.HouseNumber = "";
    }




When you save and exit the script, any component downstream of the script component will have access to the new HouseNumber field.

Flog-Blog-Out

© Alex Whittles, Purple Frog Systems Ltd

This isn’t a technical blog post of my own, but a shout out to Rob Farley and an excellent blog post explaining how to use SQL’s OPTION (FAST x) hint. He explains how you can speed up an SSIS data flow by slowing down the source query. It may seem illogical at first, but you’ll understand after you go and read Rob’s post!

Read Rob’s post here: Speeding up SSIS using OPTION (FAST)

© Alex Whittles, Purple Frog Systems Ltd

SQL Server Integration Services (SSIS) packages are used in numerous scenarios for moving data from A to B. Often they are developed and tested against a cutdown, often static, subset of data. One of the problems with this is that yes you’re testing the functionality of the package as it’s being developed, but there’s no way to determine whether the performance will scale up to a full size production environment. This level of testing is more often than not ignored, resulting in packages being deployed to live which just can’t cope with the data volume, bringing down the load process.

We can divide performance checking into two:

  1. Load testing pre deployment
  2. Continual monitoring and projections

It’s vital to undertake performance load testing of packages before they’re deployed, or at least review the source queries and SSIS components and structure to ensure there’s nothing that’s likely to cause an exponentially increasing runtime. There are loads of blog posts about SSIS performance tuning so I won’t go into that here.

What I did want to talk about here was the importance of continual monitoring. A package that runs fine today may grind to a halt in a year’s time if the live data volume continues to increase. How do you check this, and how do you project data growth into the future to predict performance problems that haven’t happened yet?

The first step is to start tracking the time taken to run each package, and store this to a table. As a rule I always build this level of logging into my template packages when I’m defining the SSIS ETL framework. Again, there are heaps of articles on different ways to do this, check out one of Jamie’s gems as a starting point. The key outcome is that you end up with a start time and end time (and hence a duration) of each package every time it runs. If you don’t have any custom logging, you can always hack together the data from the sysssislog table if you’ve enabled it (and I hope you have..!).

Once you have the raw data available, leave the package to run for a month or two and then analyse the results in Excel to perform a simple projection. Just copy the data into Excel in a format similar to this. It doesn’t matter if you have duplicate dates

Date Duration
18/08/2010 17
18/08/2010 16
19/08/2010 17
20/08/2010 18
21/08/2010 17

And then create a scatter chart

Format the X axis and make sure it’s set to be a date. You should end up with a chart similar to this.

Add a trend line to the chart by right clicking on one of the data points and click ‘add trendline’. Hopefully the trendline will be linear so choose that. If your data looks exponential then you really need to re-assess your package urgently!

There’s a nifty feature of Excel trendlines that allows you to forecast the trendline forward by x periods. If you set this to 365 it will project the package duration forward for a year. The reliability of this trendline will increase as the volume of sample data increases. i.e. if you run your packages for 3 months, you’ll be able to make better predictions than if you only run them for 2 weeks.

This clearly shows that although the package is currently taking 24 minutes to run, with the current data growth it will be taking approximately an hour in a year’s time.

When you do this for each package, you can quickly build up a picture of when you’re likely to run into trouble, and use this as justification for development resource to prevent the problems before they happen.

Frog-Blog Out

© Alex Whittles, Purple Frog Systems Ltd
The Frog Blog

I'm Alex Whittles.

I specialise in designing and implementing SQL Server business intelligence solutions, and this is my blog! Just a collection of thoughts, techniques and ramblings on SQL Server, Cubes, Data Warehouses, MDX, DAX and whatever else comes to mind.

Organising SQLRelay

Submit a session for SQLBits

Frog Blog Out
twitter
rssicon