molapviolinsummarychartCan SSAS Multidimensional be faster than SSAS Tabular for distinct counts on large datasets?

We’ve all seen how fast the SSAS Tabular model can work – it’s an in-memory, heavily compressed analytics database. However you may have seen my previous posts on Tabular and NUMA, which show that at the moment Tabular is limited to a single CPU. This obviously limits its scalability and maximum performance.

The situation

A large project I’m working on, for a multinational telecoms provider, has a problem with the query performance with their current Tabular model. It’s fine for smaller datasets but does not scale well, specifically with distinct count operations. The distinct count is the single biggest critical factor for performance in this project. With in the region of 50m distinct customers, this is a sizeable problem. After working with Microsoft to investigate the Tabular performance we identified a bug in SSAS, which has been resolved in SQL 2012 SP1 CU9, but still doesn’t work fast enough on the limited hardware.

The Multi-dimensional model is NUMA aware (Although how well it is supported depends on your workload), so can make better use of hardware, but we all know how poorly MOLAP scales with distinct counts, due to the storage engine IO bottleneck. So what is the answer?

What about Solid State?

The reason the multidimensional model often experiences performance degredation for distinct count operations is the heavy disk IO required. Aggregations can’t be used effectively, so the SSAS storage engine has to work through the entire dataset on disk. With the speed of normal disks, especially when on a SAN, this is not great.

There are now a number of solid state storage options available that provide enterprise grade storage with some very impressive performance.

So the question is, can a multidimensional cube running on solid state storage outperform an equivalent in-memory Tabular cube?

The tests

I ran various query performance tests with (MOLAP) cubes from 0.5Tb to 2Tb. The results in this post are focusing entirely on the 0.5Tb tests, with an identical dataset and equivalent test queries on a Tabular cube.

A big thank you to to the following:

  • Millicom – for the test dataset, and for allowing me to publish the tests results here
  • Violin – for the use of a storage array for the test
  • Microsoft – for the loan of a server in the MTC (Microsoft Technology Center) in Reading.

The equipment for the multidimensional cube:violinarray

  • HPDL580
  • 4 x Intel Xeon E7-4850 2Ghz
  • 40 cores, hyperthreaded to 80
  • 256Gb RAM, 1067Mhz
  • Violin 6616 SLC, 1m IOPS, 4Gbps
  • SQL Server 2012 SP1 CU9
  • Distinct count measure groups matrix partitioned by month, and 60 non-overlapping buckets of CustomerKey
  • Basic usage based optimisation applied to aggregations. There is a high liklihood this could be improved further

sqlserverThe equipment for the Tabular cube

Note that the different number of CPUs is fair, Tabular is not able to benefit from multiple CPUs, Multidimensional is. The factors that are unequal are the threads per CPU (16 Vs 20) and memory speed (666Mhz Vs 1067Mhz), please bear this in mind when looking at the performance comparison results.

The dataset is 6 monthly snapshots of 60m customers in two primary fact tables with 300m and 1.4bn rows in each.

The 7 query tests cover a range of workloads from very simple to very complex. Queries 1&2 do not include distinct count measures, all other queries include a distinct customer count from one of the primary fact tables, with varying levels of complexity, granularity and result set size. Tests 1-4 are manually created MDX queries, 5-7 are real-world Excel workbooks provided by users, each containing 1-6 large pivot tables.

Each test was run on a cold SSAS cache, and with the windows file cache also cleared before every test. Every test was run 3 times, with the average taken.

Test scenarios:

  • Tabular, using the fastest performance measured from all previous configuration tests
  • Multidimensional, using all CPUs and local Raid 5 disk storage
  • Multidimensional, using a single CPU (single NUMA node) and Violin storage
  • Multidimensional, using all CPUs and Violin storage

The Results

molapviolinresulttable

And when plotted on a chart…

molapviolinresultchart

The first 3 tests are so quick that they don’t show up on the above chart, but you can see these results in the table above.

Conclusion

It’s clear from these tests that solid state storage can breathe new life into an IO constrained multidimensional cube. This really applies to any case where you’re making heavy use of distinct counts over a large, high cardinality dataset. In this case the average performance gain over the same disk based cube was 2.6x.

Multidimensional on solid state storage can be on average 1.9x faster than the Tabular model, when working with large high cardinality distinct count operations.

Although other research and evidence suggests that Multidimensional cubes don’t perform well with multiple NUMA nodes, these tests show that if your workload is very heavily constrained by the storage engine, you can gain significant performance from multiple CPUs/NUMA nodes.

And Finally…

As well as basic timing tests using Profiler, I also monitored the CPU and disk throughput in Perfmon. The following two screenshots show the difference between the multidimensional cube on disk and then the same cube on Violin. This is for test 4, which showed the most significant performance improvement from disk to solid state.

perfmondisk

perfmonviolin

These traces show test 4, which is has the following query structure:

SELECT {   [Measures].[Fully Additive Measure 1]
         , [Measures].[Fully Additive Measure 2]
         , [Measures].[Fully Additive Measure 3]
         , [Measures].[Distinct Measure 4]} ON COLUMNS ,
NON EMPTY {[Dimension 1].[Attribute 1].[Attribute 1].MEMBERS  --6 members
         * [Dimension 2].[Attribute 2].[Attribute 2].MEMBERS  --12 members
         * [Dimention 3].[Attribute 3].[Attribute 3].MEMBERS  --5 members
         * DESCENDANTS ([Date].[Calendar].[Year].&[2013], [Date].[Calendar].[Month])} ON ROWS  --6 members
FROM [Cube]

This returns 700 distinct tuples from the crossjoin, with an average of 400k distinct customers per tuple.

This shows the potential benefit of maximising the storage IO throughput. In the first image, the IO throughput being constrained to under 100Mbps means that only 3% of the CPU can be utilised. In the second trace, by providing up to 1.7Gb per second throughput, the CPU is allowed to work at an average of 60% for the duration of the query, with a clear impact on the query performance.

These charts also shows the reduced IO latency (IO response time). With the average IO response time in the first trace showing between 40-60ms for the disk IO, and the barely registering on the scale of the chart for the Violin array. I found the IO response for the Violin tests was averaging at 450μsec. The SLC unit runs at ~200μsec latency at the 4k chunk size, so the 8k chunks we’re asking for fully complete in the 450μsec that we saw.

There’s plenty more details and tests that I’m running, I may post further details if I get time. Until then…

Frog-Blog-Out

Update #1 06/06/2014:
In the interests of completeness and fairness, I subsequently re-ran the Tabular model performance tests on the same (Microsoft supplied) server as the MOLAP model. The only difference being that to maximise the performance I restricted the server to a single CPU/NUMA node.
Somewhat surprisingly, the performance difference between Tabular and MOLAP/Violin was even greater. Multidimensional performed 3.7x faster, compared with 1.9x faster on the original Tabular server.

Update #2 06/06/2014:
If you’re planning on implementing a large scale Tabular model, do take the above tests into consideration, however please be aware that there are reasons why the Tabular performance is poor here. The incredibly high cardinality of the distinct count field, and the data can’t be split up, remodelled or simplified. This will not apply to all data scenarios.
Please take a look at a recent white paper by Alberto Ferrari at SQLBI, in which he discusses a number of ways of optimising large distinct count data sets to suit the Tabular model. He’s achieved some very impressible results. Unfortunately these methods would not benefit the scenario described in this post, but it shows what can be done with Tabular if the data suits.
http://www.sqlbi.com/articles/using-tabular-models-in-a-large-scale-commercial-solution