Integration Services
One of my favourite features of SSIS is the script component, and I know I’m not alone. Why? Because it brings the entire might of the .NET framework to SSIS, providing C# (in SQL 2008 onwards) and VB.NET extensibility for all those times where SSIS doesn’t quite have enough functionality out of the box.
Such as?
Well a problem I’ve come across a number of times is string parsing. Trying to search for and extract a specific pattern of characters from a larger text string. I’ve seen developers build crazy convoluted expressions in the derived column transform, some of which are very impressive in their complexity! This is a bad thing not good, although it shows a level of prowess in building SSIS expressions (not the most intuitive expression language!), it becomes a support nightmare for other developers.
Take for example extracting a house number from the first line of an address, we want to convert “29 Acacia Road” into “29”
Or extracting a version number from a product, converting “iPhone 4G, IOS 4.3.2, black” into “4.3.2”
Or extract the html page from a URL, converting “http://www.wibble.com/folder/page.aspx” into “page.aspx”
Regular Expressions
The .NET framework includes full support for regular expressions (Regex), in the System.Text.RegularExpressions namespace. Regex provide an incredibly powerful way of defining and finding string patterns. They take some getting used to, but once you get the hang of them you can unleash their power in your SSIS dataflow pipelines.
To find out more about regular expressions, look at the following links
Let’s look at an example
Let’s take our first example from above, extracting a house number, converting “29 Acacia Road” into “29”.
The first thing we need to do is define our Regex search pattern. In this case we know that it must be at the start of the string, and must be an integer, with any number of characters 0-9.
The pattern for this is “^[0-9]+”, which is broken down as
^ means the start of the line
[0-9] means any number
+ means 1 or more of the preceding item.
i.e. 1 or more integers at the start of the line.
What if we wanted this to also cope with a single letter after the number? i.e. “221b Baker Street”
We can add “[A-Za-z]?” to our pattern, in which
[A-Za-z] means any character A-Z in either upper or lower case
? means 0 or 1 occurrences of this
We should also add “\b” to the end of this, which indicates a word boundary. This means that 221b should be a whole word, not part of a larger sequence “221BakerSt”. We can wrap up the [A-Za-z]\b together into brackets so that the ? applies to the combination, so that any single letter must be the end of a word or it will be ignored. In this way “221BakerSt” will return 221, as will “221 Baker St”, whereas “221B Baker St” will return “221B”.
So our new pattern is “^[0-9]+([A-Za-z]\b)?”
You’ve probably gathered by now that regular expressions can get quite complicated. I’m not going to go into any more detail about them here, but hopefully this gives you some idea of what they can do. There’s plenty of reading on the web if you want to know more. You should also make use of the Regex expression tester in the link above – it will save you lots of debugging!
How do we use Regular Expressions in SSIS?
Well it turns out this is the easy bit, with the help of the script component.
Step 1 – Add a script component into your data flow pipeline, configure it as a data transform. I’m using C#, but you can use VB.NET if you want
Step 2 – Give the script access to the input and output columns
Open the script component and select the input field from the “Input Columns” screen, in this case “Address1”. This can be ReadOnly.

Go to the “Inputs and Outputs” screen and add an output column to “Output 0”. We want to set the datatype to string, length 10. This new field will contain the results of our Regex pattern matching.

Step 3 – Create the script code
Click on “Edit Script” on the Script screen which will open up Visual Studio.
Add a reference to System.Text.RegularExpressions at the top of the script
using System.Text.RegularExpressions;
Then place the necessary code in the Input0_ProcessInputRow function.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
//Replace each \ with \\ so that C# doesn't treat \ as escape character
//Pattern: Start of string, any integers, 0 or 1 letter, end of word
string sPattern = "^[0-9]+([A-Za-z]\\b)?";
string sString = Row.Address1 ?? ""; //Coalesce to empty string if NULL
//Find any matches of the pattern in the string
Match match = Regex.Match(sString, sPattern, RegexOptions.IgnoreCase);
//If a match is found
if (match.Success)
//Return the first match into the new
//HouseNumber field
Row.HouseNumber = match.Groups[0].Value;
else
//If not found, leave the HouseNumber blank
Row.HouseNumber = "";
}

When you save and exit the script, any component downstream of the script component will have access to the new HouseNumber field.

Flog-Blog-Out
This isn’t a technical blog post of my own, but a shout out to Rob Farley and an excellent blog post explaining how to use SQL’s OPTION (FAST x) hint. He explains how you can speed up an SSIS data flow by slowing down the source query. It may seem illogical at first, but you’ll understand after you go and read Rob’s post!
Read Rob’s post here: Speeding up SSIS using OPTION (FAST)
SQL Server Integration Services (SSIS) packages are used in numerous scenarios for moving data from A to B. Often they are developed and tested against a cutdown, often static, subset of data. One of the problems with this is that yes you’re testing the functionality of the package as it’s being developed, but there’s no way to determine whether the performance will scale up to a full size production environment. This level of testing is more often than not ignored, resulting in packages being deployed to live which just can’t cope with the data volume, bringing down the load process.
We can divide performance checking into two:
- Load testing pre deployment
- Continual monitoring and projections
It’s vital to undertake performance load testing of packages before they’re deployed, or at least review the source queries and SSIS components and structure to ensure there’s nothing that’s likely to cause an exponentially increasing runtime. There are loads of blog posts about SSIS performance tuning so I won’t go into that here.
What I did want to talk about here was the importance of continual monitoring. A package that runs fine today may grind to a halt in a year’s time if the live data volume continues to increase. How do you check this, and how do you project data growth into the future to predict performance problems that haven’t happened yet?
The first step is to start tracking the time taken to run each package, and store this to a table. As a rule I always build this level of logging into my template packages when I’m defining the SSIS ETL framework. Again, there are heaps of articles on different ways to do this, check out one of Jamie’s gems as a starting point. The key outcome is that you end up with a start time and end time (and hence a duration) of each package every time it runs. If you don’t have any custom logging, you can always hack together the data from the sysssislog table if you’ve enabled it (and I hope you have..!).
Once you have the raw data available, leave the package to run for a month or two and then analyse the results in Excel to perform a simple projection. Just copy the data into Excel in a format similar to this. It doesn’t matter if you have duplicate dates
| Date | Duration |
| 18/08/2010 | 17 |
| 18/08/2010 | 16 |
| 19/08/2010 | 17 |
| 20/08/2010 | 18 |
| 21/08/2010 | 17 |
| … | |
And then create a scatter chart

Format the X axis and make sure it’s set to be a date. You should end up with a chart similar to this.

Add a trend line to the chart by right clicking on one of the data points and click ‘add trendline’. Hopefully the trendline will be linear so choose that. If your data looks exponential then you really need to re-assess your package urgently!

There’s a nifty feature of Excel trendlines that allows you to forecast the trendline forward by x periods. If you set this to 365 it will project the package duration forward for a year. The reliability of this trendline will increase as the volume of sample data increases. i.e. if you run your packages for 3 months, you’ll be able to make better predictions than if you only run them for 2 weeks.

This clearly shows that although the package is currently taking 24 minutes to run, with the current data growth it will be taking approximately an hour in a year’s time.
When you do this for each package, you can quickly build up a picture of when you’re likely to run into trouble, and use this as justification for development resource to prevent the problems before they happen.
Frog-Blog Out
Purple Frog spent a very interesting day at Microsoft last week, at one of their many events promoting the launch of SQL Server 2008 R2. Rafal Lukewiecki presented an entertaining (as always!) and informative series of talks covering the release, focusing on the enhanced Business Intelligence tools available.
The primary changes to note are
- Power Pivot – An in memory, client side add-in to Excel, that allows users to create virtual cubes on their desktop and analyse over 100m records of data virtually instantly
- DAX – A new expression language, designed for non-technical (probably more realistically, semi-technical) users to extend pivot tables and power pivot tables without having to learn MDX
- Report Components – In a report consisting of a couple of tables, a chart and a few gauges (gauges, sparklines & maps are all new features of SSRS), you can save each element as a component and re-use it in different reports. This should result in much less duplication of work.
- Report Builder 3 – A thin-client tool allowing end users to create Reporting Services reports. This is a big enhancement over its predecessor s it is finally fully compatible with reports created in the Business Intelligence Development Studio (BIDS), including report components.
- Master Data Services – A centralised tool and database intended to provide governance of your organisation’s master data (centralised list of products, fiscal calendar, regions etc.).
The enhancements to Reporting Services (SSRS) are very welcome, and should be of huge benefit to anyone either currently using SSRS or considering using it. I firmly believe that there are no comparable web based reporting engines that even come close for SME organisations when looking at the whole picture including cost of implementation, ease of use, flexibility and capability.
Master Data Services as a concept has been around for a long time, but there has never been a tool available to organisations to effectively implement it. This is Microsoft’s first proper stab at delivering a workable solution, and although I’m a big fan of the concept, and have no doubt of its benefit to a SME, I’m yet to be convinced that the tool is ready for a large scale corporate environment. Time will tell how scalable and manageable the system is, and credit has to go to Microsoft for starting the ball rolling.
The most impressive addition is without a doubt PowerPivot. In a nutshell, it’s a user defined OLAP cube wrapped up within Excel 2010, running entirely in memory on a user’s workstation. If you’ve not yet played with it or seen a demo, I’ll try and elaborate for you… Think about loading Excel with 1 million rows, and then imagine sorting and filtering a number of those columns [cue going out to lunch whilst waiting for Excel to catch up]. With PowerPivot, you can sort and filter over 100 million rows of data almost in an instant – it’s very impressive indeed!
That’s the snazzy demo bit, but to reduce it to a glorified spreadsheet is very harsh indeed. It allows a user to import multiple data sources and combine them together into a single dimensional data model, PowerPivot will create your own personal cube, without you having to build a warehouse, without knowing anything about MDX, dimension hierarchies, attribute relationships, granularity etc. etc.
Microsoft’s vision and reason for creating this tool is self-service BI, allowing users to create their own cubes, data analysis environments and reporting systems. And this is where I start to have a problem…
I can’t remember the last time I designed a data warehouse, where I did not find significant data quality problems, conflicting data, missing data, duplicated data etc.. I also find it hard to think of a situation where an end user (even a power user) is sufficiently clued up about the intricacies of a source OLTP database to be able to extract the right data and know what to do with it. Or if they are, a dozen other people in different departments have a different idea about how things work, resulting in many different versions of the truth.
I’m therefore (for now!) sticking with the opinion that it is still absolutely vital for an organisation to provide a clean, consistent, dimensionally modelled data warehouse as the basis for their BI/MI infrastructure. Tools like PowerPivot then sit very nicely on top to provide an incredibly powerful and beneficial user experience, but to try and use the emergence of self-service BI tools to usher in a new ‘non-data warehouse’ era is a very dangerous route which I hope people will avoid.
In summary – this release brings with it a fantastic host of new tools, but with great power comes great responsibility…

I specialise in designing and implementing SQL Server business intelligence solutions,
and this is my blog! Just a collection of thoughts, techniques and ramblings on SQL Server, Cubes, Data Warehouses, MDX, DAX and whatever else comes to mind.
