Forecasting the Performance of SSIS packages
SQL Server Integration Services (SSIS) packages are used in numerous scenarios for moving data from A to B. Often they are developed and tested against a cutdown, often static, subset of data. One of the problems with this is that yes you’re testing the functionality of the package as it’s being developed, but there’s no way to determine whether the performance will scale up to a full size production environment. This level of testing is more often than not ignored, resulting in packages being deployed to live which just can’t cope with the data volume, bringing down the load process.
We can divide performance checking into two:
- Load testing pre deployment
- Continual monitoring and projections
It’s vital to undertake performance load testing of packages before they’re deployed, or at least review the source queries and SSIS components and structure to ensure there’s nothing that’s likely to cause an exponentially increasing runtime. There are loads of blog posts about SSIS performance tuning so I won’t go into that here.
What I did want to talk about here was the importance of continual monitoring. A package that runs fine today may grind to a halt in a year’s time if the live data volume continues to increase. How do you check this, and how do you project data growth into the future to predict performance problems that haven’t happened yet?
The first step is to start tracking the time taken to run each package, and store this to a table. As a rule I always build this level of logging into my template packages when I’m defining the SSIS ETL framework. Again, there are heaps of articles on different ways to do this, check out one of Jamie’s gems as a starting point. The key outcome is that you end up with a start time and end time (and hence a duration) of each package every time it runs. If you don’t have any custom logging, you can always hack together the data from the sysssislog table if you’ve enabled it (and I hope you have..!).
Once you have the raw data available, leave the package to run for a month or two and then analyse the results in Excel to perform a simple projection. Just copy the data into Excel in a format similar to this. It doesn’t matter if you have duplicate dates
And then create a scatter chart
Format the X axis and make sure it’s set to be a date. You should end up with a chart similar to this.
Add a trend line to the chart by right clicking on one of the data points and click ‘add trendline’. Hopefully the trendline will be linear so choose that. If your data looks exponential then you really need to re-assess your package urgently!
There’s a nifty feature of Excel trendlines that allows you to forecast the trendline forward by x periods. If you set this to 365 it will project the package duration forward for a year. The reliability of this trendline will increase as the volume of sample data increases. i.e. if you run your packages for 3 months, you’ll be able to make better predictions than if you only run them for 2 weeks.
This clearly shows that although the package is currently taking 24 minutes to run, with the current data growth it will be taking approximately an hour in a year’s time.
When you do this for each package, you can quickly build up a picture of when you’re likely to run into trouble, and use this as justification for development resource to prevent the problems before they happen.