Well after 3.5 years, I’ve finally completed my MSc Business Intelligence – hoorah! And to reward the time, effort and increased grey hair, they saw fit to give me a merit as well.
During the last year I’ve been writing a thesis investigating the performance characteristics of loading data into data warehouse dimensions. Specifically loading Type 2 SCDs using SSIS.
For those who have followed the previous posts and my conference talks on using T-SQL Merge for the purpose, you won’t be surprised at the direction of the dissertation, but it provides a useful performance comparison between T-SQL Merge, SSIS Merge Join, SSIS Lookup and the SSIS SCD Wizard.
I won’t go into the full details here of the project or results, but will show a couple of the summary charts which are of most interest. You can download the full project here:
- PDF: Performance comparison of techniques to load Type 2 slowly changing dimensions in a Kimball style data warehouse using SSIS
The charts below shows the duration taken for the Lookup, Merge and Merge-Join methods (SCD Wizard excluded for obvious reasons!).
The top chart shows the performance on a Raid 10 array of traditional hard disks.
The second chart shows the same tests run on a Fusion IO NAND flash card.
The charts clearly show that the Lookup method is the least favoured. Of the other two, Merge is [just] preferred when using solid state, although statistically they are equivalent. On HDDs, Merge and Merge-Join are equivalent until you’re loading 2-3m rows per batch, at which point Merge-Join becomes the preferred option.
Full test results and analysis in the PDF download above.
My previous few posts show how using a T-SQL approach like Merge can provide huge development benefits by automating the code. This research now shows that unless you’re loading very large data volumes the performance is equivalent to more traditional approaches.
Hope this is of use. If you want to know a bit more without reading the full 99 pages & 23k words (who could blame you?!), then my SQLBits talk video is now on-line here. This talk is slightly out of date as it was presented before I’d finished the research and analysis, but it’s largely accurate. I presented a more up to date version on a webinar for the PASS Virtual BI chapter. The recording isn’t currently available [When this post was written] but should be up soon. Keep checking on the BI PASS Chapter website.
One comment on “MSc Dissertation – Performance of Loading SCDs in SSIS”
Pingback: Leeds User Group Slide Decks - Martin Bell UK SQL Server MVP