How does the SSAS 2012 Tabular model performance change when you add more CPU sockets / NUMA nodes?

In my last post (SSAS Tabular NUMA and CPU Cores Performance) I presented the results of some testing I’d been doing on the scalability of the SSAS 2012 Tabular model. Specifically with the performance of distinct count measures over large data volumes (50-200m distinct customers).

The conclusion was that moving from 1 NUMA node (CPU socket) to 2 had no impact on query performance, so the 2nd CPU is entirely wasted. This actually contradicted other advice and recommendations that indicated that adding a second node would actually make the performance worse.

After discussing the issue with a member of the SSAS development team, they advised that the method I was using to disable cores was flawed, and that we shouldn’t be using Windows System Resource Manager. So I re-ran the tests disabling cores (and associated memory) using MSConfig, simulating a physical core removal from the server.

The test results were immediately different…

TabularNUMACoresTestThe hardware setup was the same as before, but with a larger data set:

  • 30Gb SSAS tabular cube, running on a 2 x CPU 32 core (Xeon E5-2650 2Ghz, 2 x NUMA nodes, hyperthreaded) server with 144Gb RAM
  • SQL Server 2012 SP1 CU8 Enterprise (+ a further hotfix that resolves a problem with distinct counts >2m)
  • 900m rows of data in primary fact
  • 200m distinct CustomerKey values in primary fact
  • No cube partitioning
  • DefaultSegmentRowCount: 2097152
  • ProcessingTimeboxSecPerMRow: 0
  • CPU cores and associated memory disabled using MSConfig

The two test queries were

  • Query 1: Simple, single value result of the total distinct customer count
  • Query 2: More complex distinct count query, sliced by a number of different attributes to give approx 600 result cells

As soon as the cores are increased above 16 (i.e. the 2nd CPU is introduced), the queries take 1.45x and 2x the time to run. Query performance drops significantly. The simple query takes almost exactly double the time.

These results now support other theories floating around the blogosphere, that adding extra CPUs not only doesn’t help the tabular performance, it actually significantly hinders it.

As before, the default segment count setting gave the best performance at 2m and 4m. Raising it seemed to degrade performance.

Frog Blog out

© Alex Whittles, Purple Frog Systems Ltd

8 Responses to “SSAS Tabular performance – NUMA update”

  • AK:

    Hi Alex,
    i have read both your articles and getting stuck and unable to determine “why”

    Server Layout/Details:
    ****QA****
    1 node cluster
    2 SSAS(2012) Intances – TAB & AS
    4proc*12core=48 or (0-47)
    Dell PowerEdgeM910 — E7540 @2.00GHz with 225Gb RAM
    OS-08R2 Ent

    History:
    BI Team came to a conclusion after talking to Marco Russo (thank you very much sir for your time) that the mdx query would need to run on 1 proc or NumNode0. They saw a huge difference on query execution from 7.3 sec to ~4sec. This was achi

    Test:
    A)BaseLine Test
    1)Clear Cache(xmla)
    2)Ran mdx => ~7+sec

    B)ChangeAffinity Test
    1)Connected to TAB instance(rdp)
    2)Star Task manager
    3) Identify process for TAB; right click Set Affinity
    4) Uncheck
    5) Selected CPU 0-11(Node0)
    6)Clear Cache (xmla)
    7)Ran mdx => ~3.6->4.2sec

    C)WSRM Test (http://blogs.msdn.com/b/sqlcat/archive/2013/11/05/forcing-numa-node-affinity-for-analysis-services-tabular-databases.aspx)
    1)Followed steps 2-4
    2)Brought down /up TAB resource offline
    3)Clear cache(xmla)
    2)Ran mdx => ~7+sec

    Any direction/help would be appreciated
    Next Q would be how would we use MSConfig in this scenario?

    Thanks in advance!
    AK

    • Alex:

      Hi AK

      Firstly the problem with NUMA cannot just be resolved by setting the processor affinity, as the problem is actually memory. With a NUMA architecture, each CPU is allocated its own share of RAM. If you set the processor affinity of a process then performance will still degrade significantly if that CPU tries to access RAM that is allocated to a different CPU.

      Secondly, there are reports of people getting WSRM to work, but in my experience it was not successful. It looked like it worked, but performance tests showed that it didn’t – the same as you’re seeing. The only methods advised by the SSAS development team that I spoke to was to physically remove the CPU, or use MSConfig to disable all CPUs (apart from the first) and associated memory. This is as close as you can get to physical CPU removal.
      There seems to be no reliable, supported way of setting the NUMA affinity in the Tabular model, so you should only run Tabular on a single socket server. I expect this to change in the future, as there are plenty of core affinity settings in SSAS tabular, they’re just not implemented yet. And hopefully Tabular will become NUMA aware before too long.

      To use MSconfig.exe, click on the Boot tab, then click ‘Advanced Options’. Set the ‘Number of Processors’ to the number of cores (threads) available in CPU0, and the ‘Maximum memory’ to the RAM allocated to that CPU (i.e. Total RAM divided by the number of CPUs).
      Then reboot, and Windows will only see what you’ve allowed it to see.

      Alex

  • Marco Russo:

    Alex is right, you cannot *guarantee* that NUMA affinity will work. Usually it works because SSAS is a Windows Process, and Windows try to allocate memory on the same NUMA node where the thread requesting allocation is running. If you have enough free memory, it works, but it is important that you set the affinity mask *before* the process start allocating memory.
    So you should try to restart SSAS service and *before* anyone connects, set the affinity mask of the process.
    Using a virtual machine with Hyper-V seems a more reliable way to guarantee you allocate memory on the same node.
    Reading AK test, I suspect that test C has been done without restarting the service, but just detaching/attaching the database. This could be the reason why memory was used on another node.

    Marco

  • Chris:

    I’m a NUMA-Noob.

    I’m wondering why the option of going to the BIOS and disabling NUMA is not tried / discussed. Before I ask my IT department to do that for me, I’d like to know if there is some ‘obvious’ reason not to do that as one of the test scenarios.

    I have a server* dedicated to one instance of SSAS Tabular: one socket, Intel E5-2680; 8 cores, 16 threads. Task Manager shows it as having 8 NUMA nodes. My understanding was that each *socket* represents one NUMA node, not each *core*…so there’s my first confusion.

    I’d also like opinions as to whether/when there might be a benefit to disabling NUMA, so I don’t open a useless Helpdesk ticket.

    Thank you in advance for any insights or clarifications.

    Chris

    * Hosted in VMWare
    Windows Server 2012 R2
    48 GB RAM
    1 x Intel E5-2680

    • Alex:

      Hi Chris.
      Good question. I left it out of the post as the test results were not very revealing. I found an average 8% drop in performance when disabling the NUMA setting.
      Task Manager, by default, will show you one chart per core/thread, not one per NUMA. You can change this to a per numa display by right clicking.
      If you are genuinely seeing 8 numa nodes then someone may have been playing around with your kGroup settings…
      Alex

    • Alex:

      I should add though, that your environment, test data, cube structure and test cases etc. will be different to mine, therefore please do your own performance testing!
      If you find anything interesting, please do post a follow up comment here.

  • Marco Russo:

    Chris,

    maybe that the VMware hosting is the root of the 8 socket representation in Task Manager. However, I remember a customer hosted in VMware that didn’t see that. I don’t know more because I’m not in touch with that customer now.

    Marco

  • Chris:

    Alex – good advice. All I need now is admin rights to the VMware server -
    Marco – Correct! Apparently VMware’s vNUMA, by default, gives every *core* a virtual socket.

    http://blogs.vmware.com/vsphere/2013/10/does-corespersocket-affect-performance.html?utm_source=longwhiteclouds

    Don’t know if this is good or bad. Testing to follow.

    Thank you to both of you!

    Chris

Leave a Reply

The Frog Blog

I'm Alex Whittles.

I specialise in designing and implementing SQL Server business intelligence solutions, and this is my blog! Just a collection of thoughts, techniques and ramblings on SQL Server, Cubes, Data Warehouses, MDX, DAX and whatever else comes to mind.

I'm Organising SQL Relay 2014

Submit a session for SQLBits

Frog Blog Out
twitter
rssicon