Can SSAS Multidimensional be faster than SSAS Tabular for distinct counts on large datasets?
We’ve all seen how fast the SSAS Tabular model can work – it’s an in-memory, heavily compressed analytics database. However you may have seen my previous posts on Tabular and NUMA, which show that at the moment Tabular is limited to a single CPU. This obviously limits its scalability and maximum performance.
The situation
A large project I’m working on, for a multinational telecoms provider, has a problem with the query performance with their current Tabular model. It’s fine for smaller datasets but does not scale well, specifically with distinct count operations. The distinct count is the single biggest critical factor for performance in this project. With in the region of 50m distinct customers, this is a sizeable problem. After working with Microsoft to investigate the Tabular performance we identified a bug in SSAS, which has been resolved in SQL 2012 SP1 CU9, but still doesn’t work fast enough on the limited hardware.
The Multi-dimensional model is NUMA aware (Although how well it is supported depends on your workload), so can make better use of hardware, but we all know how poorly MOLAP scales with distinct counts, due to the storage engine IO bottleneck. So what is the answer?
What about Solid State?
The reason the multidimensional model often experiences performance degredation for distinct count operations is the heavy disk IO required. Aggregations can’t be used effectively, so the SSAS storage engine has to work through the entire dataset on disk. With the speed of normal disks, especially when on a SAN, this is not great.
There are now a number of solid state storage options available that provide enterprise grade storage with some very impressive performance.
So the question is, can a multidimensional cube running on solid state storage outperform an equivalent in-memory Tabular cube?
The tests
I ran various query performance tests with (MOLAP) cubes from 0.5Tb to 2Tb. The results in this post are focusing entirely on the 0.5Tb tests, with an identical dataset and equivalent test queries on a Tabular cube.
A big thank you to to the following:
- Millicom – for the test dataset, and for allowing me to publish the tests results here
- Violin – for the use of a storage array for the test
- Microsoft – for the loan of a server in the MTC (Microsoft Technology Center) in Reading.
The equipment for the multidimensional cube:
- HPDL580
- 4 x Intel Xeon E7-4850 2Ghz
- 40 cores, hyperthreaded to 80
- 256Gb RAM, 1067Mhz
- Violin 6616 SLC, 1m IOPS, 4Gbps
- SQL Server 2012 SP1 CU9
- Distinct count measure groups matrix partitioned by month, and 60 non-overlapping buckets of CustomerKey
- Basic usage based optimisation applied to aggregations. There is a high liklihood this could be improved further
The equipment for the Tabular cube
- IBM x3650 M4
- 1 x Intel Xeon E5-2650 2Ghz
- 2nd CPU disabled to maximise performance (as per previous post)
- 8 cores hyperthreaded to 16
- 144Gb RAM, 666Mhz
- SQL Server 2012 SP1 CU9
Note that the different number of CPUs is fair, Tabular is not able to benefit from multiple CPUs, Multidimensional is. The factors that are unequal are the threads per CPU (16 Vs 20) and memory speed (666Mhz Vs 1067Mhz), please bear this in mind when looking at the performance comparison results.
The dataset is 6 monthly snapshots of 60m customers in two primary fact tables with 300m and 1.4bn rows in each.
The 7 query tests cover a range of workloads from very simple to very complex. Queries 1&2 do not include distinct count measures, all other queries include a distinct customer count from one of the primary fact tables, with varying levels of complexity, granularity and result set size. Tests 1-4 are manually created MDX queries, 5-7 are real-world Excel workbooks provided by users, each containing 1-6 large pivot tables.
Each test was run on a cold SSAS cache, and with the windows file cache also cleared before every test. Every test was run 3 times, with the average taken.
Test scenarios:
- Tabular, using the fastest performance measured from all previous configuration tests
- Multidimensional, using all CPUs and local Raid 5 disk storage
- Multidimensional, using a single CPU (single NUMA node) and Violin storage
- Multidimensional, using all CPUs and Violin storage
The Results
And when plotted on a chart…
The first 3 tests are so quick that they don’t show up on the above chart, but you can see these results in the table above.
Conclusion
It’s clear from these tests that solid state storage can breathe new life into an IO constrained multidimensional cube. This really applies to any case where you’re making heavy use of distinct counts over a large, high cardinality dataset. In this case the average performance gain over the same disk based cube was 2.6x.
Multidimensional on solid state storage can be on average 1.9x faster than the Tabular model, when working with large high cardinality distinct count operations.
Although other research and evidence suggests that Multidimensional cubes don’t perform well with multiple NUMA nodes, these tests show that if your workload is very heavily constrained by the storage engine, you can gain significant performance from multiple CPUs/NUMA nodes.
And Finally…
As well as basic timing tests using Profiler, I also monitored the CPU and disk throughput in Perfmon. The following two screenshots show the difference between the multidimensional cube on disk and then the same cube on Violin. This is for test 4, which showed the most significant performance improvement from disk to solid state.
These traces show test 4, which is has the following query structure:
SELECT { [Measures].[Fully Additive Measure 1] , [Measures].[Fully Additive Measure 2] , [Measures].[Fully Additive Measure 3] , [Measures].[Distinct Measure 4]} ON COLUMNS , NON EMPTY {[Dimension 1].[Attribute 1].[Attribute 1].MEMBERS --6 members * [Dimension 2].[Attribute 2].[Attribute 2].MEMBERS --12 members * [Dimention 3].[Attribute 3].[Attribute 3].MEMBERS --5 members * DESCENDANTS ([Date].[Calendar].[Year].&[2013], [Date].[Calendar].[Month])} ON ROWS --6 members FROM [Cube]
This returns 700 distinct tuples from the crossjoin, with an average of 400k distinct customers per tuple.
This shows the potential benefit of maximising the storage IO throughput. In the first image, the IO throughput being constrained to under 100Mbps means that only 3% of the CPU can be utilised. In the second trace, by providing up to 1.7Gb per second throughput, the CPU is allowed to work at an average of 60% for the duration of the query, with a clear impact on the query performance.
These charts also shows the reduced IO latency (IO response time). With the average IO response time in the first trace showing between 40-60ms for the disk IO, and the barely registering on the scale of the chart for the Violin array. I found the IO response for the Violin tests was averaging at 450μsec. The SLC unit runs at ~200μsec latency at the 4k chunk size, so the 8k chunks we’re asking for fully complete in the 450μsec that we saw.
There’s plenty more details and tests that I’m running, I may post further details if I get time. Until then…
Frog-Blog-Out
Update #1 06/06/2014:
In the interests of completeness and fairness, I subsequently re-ran the Tabular model performance tests on the same (Microsoft supplied) server as the MOLAP model. The only difference being that to maximise the performance I restricted the server to a single CPU/NUMA node.
Somewhat surprisingly, the performance difference between Tabular and MOLAP/Violin was even greater. Multidimensional performed 3.7x faster, compared with 1.9x faster on the original Tabular server.
Update #2 06/06/2014:
If you’re planning on implementing a large scale Tabular model, do take the above tests into consideration, however please be aware that there are reasons why the Tabular performance is poor here. The incredibly high cardinality of the distinct count field, and the data can’t be split up, remodelled or simplified. This will not apply to all data scenarios.
Please take a look at a recent white paper by Alberto Ferrari at SQLBI, in which he discusses a number of ways of optimising large distinct count data sets to suit the Tabular model. He’s achieved some very impressible results. Unfortunately these methods would not benefit the scenario described in this post, but it shows what can be done with Tabular if the data suits.
http://www.sqlbi.com/articles/using-tabular-models-in-a-large-scale-commercial-solution
Hi Alex,
Great stuff. I’m jealous of you getting to play with the big boy toys, and there is so much hype about the tabular being “faster” without good tests like this to back it up.
A few things things which I’d love to see tested which maybe will put the tabular in better light:
a) use of 1,800 Mhz RAM on the tabular. It does seem to be super sensitive to memory performance, so a shame the MTC only had the 666 Mhz memory. if the tabular has 2.5x faster memory performance that may have swung things a bit maybe in some of tests.
b) Did you try a DAX query instead of an MDX query on the tabular ? I’m personally seeing a much larger overhead to MDX on tabular than I assumed there would be, so am pretty much sticking to DAX where possible (even on SSRS!)
Hi Bob
Yeah I’m intrigued how much memory speed impacts Tabular so that would be a good test. The Tabular tests were on the client’s server, not the MTC box. If I get time I’ll try and set up tabular on the MTC server and update the results.
In the original Tabular tests I did compare DAX against MDX and found virtually zero difference. However this is because the queries at the moment are pretty simple distinct count queries, the time taken to handle the storage engine load is by far the significant load, the formula engine and DAX/MDX translation seems insignificant on these tests.
Also, the client’s entire workload is MDX through Excel pivots, so I’m trying to replicate the real world usage as accurately as possible.
I’ll upate the post with new tabular performance stats when I can
Regards
Alex
Hi Bob
Also worth mentioning, that with 1800Mhz RAM (and even 1667) we’re currently very limited on how much RAM CPUs can support, meaning we can only run small Tabular models on the fastest RAM. And small models aren’t really the problem, they run super fast all the time anyway.
However, the Xeon E7v2 chips may change the game slightly, by allowing far more memory at the faster speeds. If you have an E7v2 running 1800Mhz RAM to hand then please let me know how it performs!
Alex
As you say the E7v2 is a game changer for sure: You can get 756GB RAM@1866 Ghz with 2x 8 core 3.3 Ghz CPU for under 20k euro. This beast was made for SSAS !
We have a couple of older dell RX720s in our test lab on the older E5 Series and E5v2 series. They only have 2.6-2.8 Ghz CPU, 1,600 FSB and 16 internal disks rates at 1.6 GB/Sec but at a cost of under 15k it usually blows away customer environments who may have paid 400k+ for slower servers and an expensive SAN
Happy to run some tests to see how that does on the tabular. One of the servers is completely blank now here.
Bob (and Alex), I think it would be very beneficial to write an article calculating the ROI of this type of server, and showing some example of configuration for Tabular. This would be very beneficial to other customers and will help them selling “internally” Tabular, explaining to IT why they should buy a different HW for Tabular. I spent hours on this, and an “article on the Internet” would be very helpful in these fights.
I am available to contribute on that – I don’t have access to so many different hardware configurations to do the tests, but I can contribute in gathering these numbers (from testers) and writing/reviewing an article.
Alex,
I have noted the same things that Bob did. I can provide you a small benchmark (database to load and DAX queries to run) to test hardware, especially for single threaded operations (such as formula engine). Talking about ROI, I would also consider spending money on a faster CPU before using an external I/O that I imagine being more expensive.
That said, I know that for certain database and queries, OLAP is better, because you can rely on aggregations. But the TCO is higher because of the maintenance it requires (not considering the time to correctly setup and tune the aggregations).
I’m also worried that a faster RAM might not be useful without a faster CPU – but I’m not entirely sure about that, it’s just a thought.
Just to explain why CPU and RAM matters: I have a notebook that runs 2x faster than an Azure VM with Tabular, and a desktop that runs 4x faster than an Azure VM with Tabular. And neither of these machines cost more than 4,000 USD. When I did the same test with a customer’s hardware, they chose to buy another server instead of using a VM in their big baby, that had many many cores running at 2.6GHz and RAM at 1300 MHz. New hw with 3.5GHz and 1800 MHz is 2x faster. However, the CPU clock is not always a good indication, I’ve seen X7560 performing well at 2.26GHz. It’s really hard to predict performance with Tabular without a specific test on the hardware.
Marco
Hi Marco
Yes I agree, predicting performance for Tabular is a dark art, and causing some headaches 🙂 I’ve been trying to get access to a machine with faster memory to test with but have not managed it yet. I’m going to try and get time to test Tabular on the MTC server to even out the playing field, but time is always in short supply.
“I’m also worried that a faster RAM might not be useful without a faster CPU” – On the tests I’ve run on other hardware, the CPU has more often than not been the bottleneck maxed out at 100%, so I’m also doubtful how much difference faster RAM would make on its own. However as you said, combined with more CPU grunt it will certainly help. But with Tabular limited to a single NUMA node, there’s a limit to how much CPU power it’s able to use at the moment.
I’m hopeful for better CPU utilisation in Tabular vNext…
Alex
Hi Alex,
you should look at utilization of CPU for single thread, too – formula engine is single threaded and in those cases CPU clock (but also RAM) might be very important. Moreover, only one thread per segment is used by Tabular so consider segment size (default 8 million rows) and rows in a table in order to realize how many cores might be used at max during VertiPaq scans.
Marco
Hi Marco
The Tabular results I’ve used here were the fastest I clocked in quite a large set of tests, including different cores, NUMA nodes, segment size and a number of other factors. I wrote up the segment size test results in another post:
https://purplefrogsystems.com/2013/12/ssas-tabular-performance-defaultsegmentrowcount/
My expectation was that a higher segment size should improve performance, but I found (with this dataset) the optimal size was 1m or 2m, with distinct count query time increasing rapidly when segment size was raised over 8m.
With CPU thread utilisation, in the Tabular tests all threads were maxed out at 100% for most of the queries, with only minor fluctuation.
With the Multidimenstional tests, the vast bulk of the workload is storage engine, so can take advantage of multiple threads. What I found in these tests was when using disk storage, multiple threads were used, but only at an average 2-4% level. When using the Violin storage, one socket would be maxed out at 100% consistently, the other three would fluctuate around 25-40%, giving a total CPU utilisation of 40-60%. None of these queries are formula engine constrained, so the single threaded FE is negligable in this scenario.
However just out of interest, I did also run tests on an M2M model (with a large non-compressed bridge table) which, as you’d expect, after an initial burst of high CPU and IO settled down to a single formula engine thread maxed out for a considerable duration. I may post the screenshots in another blog, to encourage people to read your M2M revolution paper.
Regards
Alex
I just commented in the related blog post that SSAS 2012 SP1 CU9 made a very important improvement on DISTINCTCOUNT calculation. In general, reducing segment size might improve parallelism on small tables, but when you have enough segments usually a larger segment provides a better compression, resulting also in better performance. We have seen exactly the opposite behavior, improving results by using larger segments instead of smaller ones, at least with regular measure (not distinct count).
Unfortunately, the improvement does not impact M2M performance, that still suffer when a dimension involved in M2M relationship has more than 1-2 million rows (depending on hardware). I hope this will improve in a future update/release.
Hi Marco
Yes this test and the NUMA testing in earlier posts used either CU9 or CU8 + the hotfix for the distinct count fix. Even with the hotfix applied I still found better compression and faster processing speeds with the smaller segment size.
I’m not surprised that you’ve found better performance with larger segment sizes – my results came as a surprise to me and to the Microsoft support guy helping with this performance testing.
In this project I’m only really concerned with distinct counts, not normal measures. So this may explain the different behaviour. It could also just be peculiarities of the [slightly odd] dataset I’m using, but the results are repeatable and consistent. I’ve not managed to explain it yet. If you have any results from segment size testing on large high cardinality distinct counts I’d be interested to know the results to see if it’s just me!
Worth noting that the distinct count >2m fix did not make it in in time to be included in SQL 2014 RTM, so be careful if you have large distinct counts in a 2014 model. From what I understand it should make it into 2014 CU1, so not long to wait.
I have a customer that has a situation with a large distinct count column (20m). I will try to ask them whether they can do some test and I’ll update you on results. It will be not quick, but I will try.
Alex,
Great work, and thanks for taking the time to do the comparisons. Now, with the lower cost of MPP, such as Azure Data Warehouse, and access to a more complete set of columnar indexes in SQL Serve 2016, what would be great is you added one more approach to the mix: ROLAP, using multi-dimensional.
Potentially gone are the database size limits, especially with Azure DW, you just add more power. And ROLAP allows us to use multi-dimensional more as a virtual cube view/data mart. Gone are the data warehouse to MOLAP ETL processes. Many pluses, and gone are all those Tabular downsides. Just a thought.