0845 643 64 63

Alex Whittles

Why you should start presenting at conferences

Have you presented a session at a user group or a conference? No? then read on to find out why you need to change that.

Many of the readers of this blog will be active members of the SQL Server community, reading/writing blogs themselves, going to conferences or user groups, partitipating in webinars, etc. We’re fortunate in the UK to have the most active SQL Server community in the World. Yes there are physically more user groups and conferences in the US, but per capita the UK had over twice as many SQL Server conferences as the US last year. So if you’re in the UK you don’t have any excuse to get involved. That doesn’t mean the rest of the world misses out – There were 85 SQL Saturdays around the world in 2013, half of those were outside the US. And then there are user groups, other conferences and webinars etc.

Attending these events is one thing, and will without question help your career through increased knowledge but also networking (the who’s who of SQL Server are all at these events).

But to really benefit the most you need to consider speaking.

“Speaking?!?” I hear you cry, “No chance” – followed by a number of excuses:

  • Speaking is only for the experts, I don’t know enough
  • It’s far too scary – I couldn’t stand up and talk in front of 500 people for an hour
  • Why would anyone select me to speak? No-one knows me
  • What if I was asked a question I couldn’t answer? I’d look stupid, they’d laugh me off the stage
  • Why should I bother, I wouldn’t get anything from it

Ok, so let me work through these one by one

Speaking is only for the experts, I don’t know enough

*Wrong*

Yes the top experts in the world come and speak at conferences, but not everyone wants to have their brain fried by a Brent Ozar, or be dazzled by system internals from MVPs and MCMs. Some people want to learn the basics. Conferences always feature a selection of levels from 100 through 500 (‘introduction’ through to ‘WTF?!’). If no-one presented the basic beginners introduction to topics, how would anyone start to learn new skills?

There will ALWAYS be people that know more than you, but conversely there will ALWAYS be people who know less than you about a particular topic.

Therefore even if you’re not an expert, your experience will be of interest to someone, and you could help them learn.

It’s far too scary – I couldn’t stand up and talk in front of 500 people for an hour

*Yes – I get that*

You’re absolutely on the money here, it is scary. In the same way diving off the 10m board at your local swimming baths is pretty terrifying. At first. Your first go on a roller coaster was a little daunting. But you loved it right?

The rest of the community is here to help you here. You wouldn’t dive straight off the 10m board would you? You’d start on the 1m, then work up to 3m, then 5m before trying the big one.

It’s the same with speaking. Start off doing an informal talk to 10-20 people at your local user group, or doing a 2-5 minute lightning talk at a bigger event. Use those to find your feet and get used to being on stage. Use these to refine material, get constructive feedback, and improve your slides, demos and patter.

Before you know it you’ll be desperate to get on the big stage!

Why would anyone select me to speak? No-one knows me

*Because you’re different*

When trying to select speakers for an event, it’s great to have new faces, new topics, or even new perspectives on old topics – it adds variety to the line-up. In your work, you’ve picked up different skills than someone else in a different job using the same tools – so your personal experience is new to an audience, they haven’t heard about the problems that YOU have solved and how YOU solved them. Your personal experience is what makes you different, and interesting.

It’s also handy if you’ve presented the session before to a test audience – it will give the selectors confidence that you can do it. See earlier point about practising at a user group… Being an active part of the user group scene is a good way of getting to know conference organisers, and of course other like minded people in your area – always good for networking.

What if I was asked a question I couldn’t answer? I’d look stupid, they’d laugh me off the stage

*Wrong*

Yes there’s a good chance there will be someone in the audience who knows more than you about a specific topic, or who asks a question that you can’t answer. But the SQL community is a VERY friendly place. We’re all there to learn, not to humiliate you on stage. If you don’t know the answer to a question just be honest; say I’m sorry I don’t know the answer to that, but let me look into it afterwards and I’ll get back to you. That then feeds back into your own knowledge, and you can incorporate the answer into your slides for the next time you give the talk.

No-one will mind if you don’t know the answer – you will NOT be laughed off stage!

Why should I bother, I wouldn’t get anything from it

*Wrong*

This is actually why I started to think about writing this post. I was watching a SciShow YouTube video (presented by the awesome Hank Green) which talked about how you learn more when you teach than by just reading/revising/etc. (If you watch the video, the segment starts at 1:49). It’s based on research from Washington University, which found that you can learn something up to 25-50% better when you’re planning on teaching the material to someone else.

This really struck a chord with me. One of the reasons that I write this blog and present at conferences is that it helps me learn. If I find a new and interesting item or topic I read about it, I investigate it and play around. If time allows I then write a blog, if the blog gains traction then I turn it into a talk. This process helps me focus on the topic, and helps structure my own investigation and learning. Every time I present on a topic I get feedback and questions, this then helps me learn even more by investigating the answers to those questions – it’s a great learning cycle.

In the processes it helps other learn, and everyone wins. Smashing.

Going back to the second point, yes I found it a little daunting at first. It took me a few user group talks to find my rhythm, but when I got selected for my first talk at SQLBits I was ready, and I got such an immense buzz from it – and I haven’t looked back since.

Conclusion

If you don’t currently attend SQL community events (in fact, not just SQL – any tech!) then get involved.

If you attend but don’t speak – do give it some serious thought – you may just surprise yourself!

If you don’t know where to look for community activities, here are some starting points for you:

UK

  • UK SQL Server User Groups – in your local town/city, every month or two, 3 hours in an evening, free.
  • SQL Relay – full day SQL conferences, touring the UK annually (8 events in October 13-31 2014), free.
  • SQLBits – 3 day mega conference in the UK annually, paid/free

WorldWide

  • SQL Saturday – 2 day local conferences, all over the world, paid/free
  • PASS – Professional Association of SQL Server, SQL user groups worldwide, free
  • PASS Summit – the largest SQL conference in the world – in the US, paid

 

Frog-Blog-Out

SSAS Tabular String Imported as Integer

Let me start by saying that I think the SSAS Tabular model is great. But…. there are a number of problems that Microsoft still need to get ironed out.

Not least of which is being able to import data properly directly from CSV/Text files. Yes you can import directly from csv, but you are given absolutely no control over the process, and this can lead to some serious problems.

One of these issues is the rather odd automated data type selection that is used by the Tabular import process. A column consisting of a combination of alpha and numerical text is more often than not treated as an integer, with all text information stripped out. And the Tabular model designer provides absolutely no way of changing this behaviour.

For example, take the following csv file, containing just three columns; ID, ItemCode and ItemValue.

ChangeTabularDataType01

The second column should clearly be interpreted as text, as the 5th row contains a non-numerical value. However when this is imported into the Tabular model, it treats the column as an integer. As you can see from the screenshot below, because the 5th row doesn’t contain a valid integer, the value is just ignored.

ChangeTabularDataType02

One would expect that we could simply go into the table in the designer, and update the Data Type property for the column. No. This just takes the numerical value and formats it as text. But any non-numerical values are still stripped out. The problem is that when the Tabular model reloads the file, it detects that the data type is an integer, and there’s nothing we can do to override it.

The only way of getting around this is by wrapping the strings in quotes within our csv.

ChangeTabularDataType03

This is irritating but we really don’t have a choice, just remember to be explicit in any csv definition that is to be imported directly into a Tabular model.

However, what if we find this out too late? If we’ve already built the model, added all of our DAX calculations in, set up the relationships etc., how do we change the format of an already created column?

Firstly we have to get the csv updated to wrap every string column in quotes.

If we try and just reimport this we’ll get an error “Unable to convert a value to the data type requested for table ‘xxx’ column ‘xxx’. The current operation was cancelled because another operation in the transaction failed.”

ChangeTabularDataType04

So first you have to change the column data type for the table in the designer. Click on the column, then in the column properties, change Data Type to Text.

ChangeTabularDataType05Once this is done, you can reprocess the table and import the actual text.

This is all well and good, and works most of the time. However, I recently encountered a Tabular model which had this problem, and the above process wouldn’t work. So the only solution I found was to go routing in the Tabular model’s xml source code, and force it to change. Hacking it manually worked a treat, so I thought I’d share the process here. Just be careful – always keep a backup of your files before you change anything!

To do this, open up your .bim file in a suitable text editor. I highly recommend Notepad++, as it works great for XML.

We need to change three things:

1) Change the data type for the table column key & name, within the dimensions

<Alter><ObjectDefinition><Database><Dimensions><Dimension><Attributes><Attribute><KeyColumns><KeyColumn>
change <DataType>BigInt</DataType> to <DataType>WChar</DataType>     (Note this may be Int or BigInt)
change <DataSize>-1</DataSize> to <DataSize>32768</DataSize>

Then do the same for the <NameColumn>

ChangeTabularDataType06

2) Change the data type for the table column key & name in the corresponding cube

<Alter><ObjectDefinition><Database><Cubes><Cube><MeasureGroups><MeasureGroup><Dimensions><Dimension><Attributes><Attribute><KeyColumns><KeyColumn>
change <DataType>BigInt</DataType> to <DataType>WChar</DataType>     (Note this may be Int or BigInt)
change <DataSize>-1</DataSize> to <DataSize>32768</DataSize>

Then do the same for the <NameColumn>

3) Change the definition of the csv datasource

<Alter><ObjectDefinition><Database><DataSourceViews><DataSourceView><Schema><xs:schema><xs:element><xs:complexType><xs:choice><xs:element><xs:complexType><xs:sequence><xs:element>
delete ‘type=”xs:int”‘ from the element, and replace it with a SimpleType and restriction defining the string:

<xs:simpleType><xs:restriction base=”xs:string”><xs:maxLength value=”32768″ /></xs:restriction></xs:simpleType>ChangeTabularDataType07

 

Then save the .bim file, reload your Tabular model, and reprocess the table. Problem solved.

Frog-Blog-Out

 

 

Excel doesn’t open CSV files correctly from sp_send_dbmail

A nice little nugget today for anyone who uses SQL dbmail to send data in csv attachments.

Sometimes, depending on your language settings, CSV files don’t open correctly in Excel. Excel ignores the ‘,’ and puts all data in column A. You then have to go through the hassle of closing the file and opening it throug the flat file wizard.

 

excelcsv1

There’s a very nice solution to this that I found in this thread. It’s to insert a new row in the top of the file containing “sep=,”. This forces Excel to understand that it is a comma delimited file, and ensures that it will open correctly.

The only problem, how do you insert a header line when using dbmail?…

[Caution – ugly hack alert…]

You can alter the name of the first column to include this header text. We simply rename “Column1” to “sep=,{CR}{LF}Column1”. Then when dbmail prints out the column headers in the file, the Column1 name will be split on two lines, preceeded by “sep=,”.

excelcsv2

Excel treats this first row as an instruction and does not display it, just uses it to make sure it formats the data correctly.

So how do we do this in SQL?

DECLARE @Column1Name VARCHAR(255)
DECLARE @Query VARCHAR(2048)
SET @Column1Name = '[sep=,' + CHAR(13) + CHAR(10) + 'Column1]'
SET @Query = 'SELECT Column1 AS ' + @Column1Name + ', Column2, Column3 FROM myTable'

EXEC msdb.dbo.sp_send_dbmail
@profile_name=XXX
,[Other parameters as required]
,@query=@Query
,@attach_query_result_as_file=1
,@query_attachment_filename='QueryResults.csv'
,@query_result_separator=',' --enforce csv
,@query_result_no_padding=1 --trim
,@query_result_width=32767  --stop wordwrap

And there you go. It’s a bit of a hack, but works well.

Frog-Blog-Out

Analysis Services Tabular or Multidimensional? A performance comparison

molapviolinsummarychartCan SSAS Multidimensional be faster than SSAS Tabular for distinct counts on large datasets?

We’ve all seen how fast the SSAS Tabular model can work – it’s an in-memory, heavily compressed analytics database. However you may have seen my previous posts on Tabular and NUMA, which show that at the moment Tabular is limited to a single CPU. This obviously limits its scalability and maximum performance.

The situation

A large project I’m working on, for a multinational telecoms provider, has a problem with the query performance with their current Tabular model. It’s fine for smaller datasets but does not scale well, specifically with distinct count operations. The distinct count is the single biggest critical factor for performance in this project. With in the region of 50m distinct customers, this is a sizeable problem. After working with Microsoft to investigate the Tabular performance we identified a bug in SSAS, which has been resolved in SQL 2012 SP1 CU9, but still doesn’t work fast enough on the limited hardware.

The Multi-dimensional model is NUMA aware (Although how well it is supported depends on your workload), so can make better use of hardware, but we all know how poorly MOLAP scales with distinct counts, due to the storage engine IO bottleneck. So what is the answer?

What about Solid State?

The reason the multidimensional model often experiences performance degredation for distinct count operations is the heavy disk IO required. Aggregations can’t be used effectively, so the SSAS storage engine has to work through the entire dataset on disk. With the speed of normal disks, especially when on a SAN, this is not great.

There are now a number of solid state storage options available that provide enterprise grade storage with some very impressive performance.

So the question is, can a multidimensional cube running on solid state storage outperform an equivalent in-memory Tabular cube?

The tests

I ran various query performance tests with (MOLAP) cubes from 0.5Tb to 2Tb. The results in this post are focusing entirely on the 0.5Tb tests, with an identical dataset and equivalent test queries on a Tabular cube.

A big thank you to to the following:

  • Millicom – for the test dataset, and for allowing me to publish the tests results here
  • Violin – for the use of a storage array for the test
  • Microsoft – for the loan of a server in the MTC (Microsoft Technology Center) in Reading.

The equipment for the multidimensional cube:violinarray

  • HPDL580
  • 4 x Intel Xeon E7-4850 2Ghz
  • 40 cores, hyperthreaded to 80
  • 256Gb RAM, 1067Mhz
  • Violin 6616 SLC, 1m IOPS, 4Gbps
  • SQL Server 2012 SP1 CU9
  • Distinct count measure groups matrix partitioned by month, and 60 non-overlapping buckets of CustomerKey
  • Basic usage based optimisation applied to aggregations. There is a high liklihood this could be improved further

sqlserverThe equipment for the Tabular cube

Note that the different number of CPUs is fair, Tabular is not able to benefit from multiple CPUs, Multidimensional is. The factors that are unequal are the threads per CPU (16 Vs 20) and memory speed (666Mhz Vs 1067Mhz), please bear this in mind when looking at the performance comparison results.

The dataset is 6 monthly snapshots of 60m customers in two primary fact tables with 300m and 1.4bn rows in each.

The 7 query tests cover a range of workloads from very simple to very complex. Queries 1&2 do not include distinct count measures, all other queries include a distinct customer count from one of the primary fact tables, with varying levels of complexity, granularity and result set size. Tests 1-4 are manually created MDX queries, 5-7 are real-world Excel workbooks provided by users, each containing 1-6 large pivot tables.

Each test was run on a cold SSAS cache, and with the windows file cache also cleared before every test. Every test was run 3 times, with the average taken.

Test scenarios:

  • Tabular, using the fastest performance measured from all previous configuration tests
  • Multidimensional, using all CPUs and local Raid 5 disk storage
  • Multidimensional, using a single CPU (single NUMA node) and Violin storage
  • Multidimensional, using all CPUs and Violin storage

The Results

molapviolinresulttable

And when plotted on a chart…

molapviolinresultchart

The first 3 tests are so quick that they don’t show up on the above chart, but you can see these results in the table above.

Conclusion

It’s clear from these tests that solid state storage can breathe new life into an IO constrained multidimensional cube. This really applies to any case where you’re making heavy use of distinct counts over a large, high cardinality dataset. In this case the average performance gain over the same disk based cube was 2.6x.

Multidimensional on solid state storage can be on average 1.9x faster than the Tabular model, when working with large high cardinality distinct count operations.

Although other research and evidence suggests that Multidimensional cubes don’t perform well with multiple NUMA nodes, these tests show that if your workload is very heavily constrained by the storage engine, you can gain significant performance from multiple CPUs/NUMA nodes.

And Finally…

As well as basic timing tests using Profiler, I also monitored the CPU and disk throughput in Perfmon. The following two screenshots show the difference between the multidimensional cube on disk and then the same cube on Violin. This is for test 4, which showed the most significant performance improvement from disk to solid state.

perfmondisk

perfmonviolin

These traces show test 4, which is has the following query structure:

SELECT {   [Measures].[Fully Additive Measure 1]
         , [Measures].[Fully Additive Measure 2]
         , [Measures].[Fully Additive Measure 3]
         , [Measures].[Distinct Measure 4]} ON COLUMNS ,
NON EMPTY {[Dimension 1].[Attribute 1].[Attribute 1].MEMBERS  --6 members
         * [Dimension 2].[Attribute 2].[Attribute 2].MEMBERS  --12 members
         * [Dimention 3].[Attribute 3].[Attribute 3].MEMBERS  --5 members
         * DESCENDANTS ([Date].[Calendar].[Year].&[2013], [Date].[Calendar].[Month])} ON ROWS  --6 members
FROM [Cube]

This returns 700 distinct tuples from the crossjoin, with an average of 400k distinct customers per tuple.

This shows the potential benefit of maximising the storage IO throughput. In the first image, the IO throughput being constrained to under 100Mbps means that only 3% of the CPU can be utilised. In the second trace, by providing up to 1.7Gb per second throughput, the CPU is allowed to work at an average of 60% for the duration of the query, with a clear impact on the query performance.

These charts also shows the reduced IO latency (IO response time). With the average IO response time in the first trace showing between 40-60ms for the disk IO, and the barely registering on the scale of the chart for the Violin array. I found the IO response for the Violin tests was averaging at 450μsec. The SLC unit runs at ~200μsec latency at the 4k chunk size, so the 8k chunks we’re asking for fully complete in the 450μsec that we saw.

There’s plenty more details and tests that I’m running, I may post further details if I get time. Until then…

Frog-Blog-Out

Update #1 06/06/2014:
In the interests of completeness and fairness, I subsequently re-ran the Tabular model performance tests on the same (Microsoft supplied) server as the MOLAP model. The only difference being that to maximise the performance I restricted the server to a single CPU/NUMA node.
Somewhat surprisingly, the performance difference between Tabular and MOLAP/Violin was even greater. Multidimensional performed 3.7x faster, compared with 1.9x faster on the original Tabular server.

Update #2 06/06/2014:
If you’re planning on implementing a large scale Tabular model, do take the above tests into consideration, however please be aware that there are reasons why the Tabular performance is poor here. The incredibly high cardinality of the distinct count field, and the data can’t be split up, remodelled or simplified. This will not apply to all data scenarios.
Please take a look at a recent white paper by Alberto Ferrari at SQLBI, in which he discusses a number of ways of optimising large distinct count data sets to suit the Tabular model. He’s achieved some very impressible results. Unfortunately these methods would not benefit the scenario described in this post, but it shows what can be done with Tabular if the data suits.
http://www.sqlbi.com/articles/using-tabular-models-in-a-large-scale-commercial-solution

SSAS Tabular performance – NUMA update

How does the SSAS 2012 Tabular model performance change when you add more CPU sockets / NUMA nodes?

In my last post (SSAS Tabular NUMA and CPU Cores Performance) I presented the results of some testing I’d been doing on the scalability of the SSAS 2012 Tabular model. Specifically with the performance of distinct count measures over large data volumes (50-200m distinct customers).

The conclusion was that moving from 1 NUMA node (CPU socket) to 2 had no impact on query performance, so the 2nd CPU is entirely wasted. This actually contradicted other advice and recommendations that indicated that adding a second node would actually make the performance worse.

After discussing the issue with a member of the SSAS development team, they advised that the method I was using to disable cores was flawed, and that we shouldn’t be using Windows System Resource Manager. So I re-ran the tests disabling cores (and associated memory) using MSConfig, simulating a physical core removal from the server.

The test results were immediately different…

TabularNUMACoresTestThe hardware setup was the same as before, but with a larger data set:

  • 30Gb SSAS tabular cube, running on a 2 x CPU 32 core (Xeon E5-2650 2Ghz, 2 x NUMA nodes, hyperthreaded) server with 144Gb RAM
  • SQL Server 2012 SP1 CU8 Enterprise (+ a further hotfix that resolves a problem with distinct counts >2m)
  • 900m rows of data in primary fact
  • 200m distinct CustomerKey values in primary fact
  • No cube partitioning
  • DefaultSegmentRowCount: 2097152
  • ProcessingTimeboxSecPerMRow: 0
  • CPU cores and associated memory disabled using MSConfig

The two test queries were

  • Query 1: Simple, single value result of the total distinct customer count
  • Query 2: More complex distinct count query, sliced by a number of different attributes to give approx 600 result cells

As soon as the cores are increased above 16 (i.e. the 2nd CPU is introduced), the queries take 1.45x and 2x the time to run. Query performance drops significantly. The simple query takes almost exactly double the time.

These results now support other theories floating around the blogosphere, that adding extra CPUs not only doesn’t help the tabular performance, it actually significantly hinders it.

As before, the default segment count setting gave the best performance at 2m and 4m. Raising it seemed to degrade performance.

Frog Blog out

Renaming an SSAS Tabular Model

I came across a frustrating problem today. I’d just finished processing a large tabular cube (SQL Server 2012), which had taken 11 hours in total.

On trying to connect to the cube to test it, I’d made a schoolboy error; The database was named correctly, but the model inside it was named MyCubeName_Test instead of MyCubeName. No problem, I’ll just right click the cube in SSMS and rename it. Well, no, there is no option to rename a model, just the database. I didn’t fancy doing a full reprocess, but luckily a little digging in the xml files presented a solution.

  1. Detach the cube
  2. Open up the cube’s data folder in explorer (x:\xx\OLAP\data\MyCubeName.0.db, or whatever it happens to be in your case)
  3. Find the Model.xx.cub.xml file, and open it in Notepad++ (other text editors are available…)
  4. Search for the <Name> tag, and just change the name inside it
  5. Save the file and close it
  6. Re-attach the cube

Simples

Frog-Blog-Out

SSAS Tabular – NUMA and CPU Cores Performance

[UPDATE] After further investigation, I found that the tests in this post were inacurate and the results unreliable. Updated NUMA test results here

In my last post (SSAS Tabular Performance – DefaultSegmentRowCount) I presented some analysis of the query performance impact of changing the DefaultSegmentRowCount setting. This post describes the next tests that I ran on the same system, investigating the impact of restricting SSAS to just 1 NUMA node instead of the 2 avaiable on the server.

It’s well known that SSAS Tabular is not NUMA aware, so it’s common to see advice recommending affiliating SSAS to a single NUMA node to improve performance.

From what I’d read, I was expecting that by affiliating SSAS to a single NUMA node that the query performance would improve slightly, maybe 10-30%.

Recap of the setup:

  • 7.6Gb SSAS tabular cube, running on a 2 x CPU 32 core (Xeon E5-2650 2Ghz, 2 x NUMA nodes) server with 144Gb RAM
  • SQL Server 2012 SP1 CU7 Enterprise
  • 167m rows of data in primary fact
  • 80m distinct CustomerKey values in primary fact
  • No cube partitioning
  • DefaultSegmentRowCount: 2097152
  • ProcessingTimeboxSecPerMRow: 0
  • CPU core affinity configured using Windows System Resource Manager (see John Sirman’s great guide to using WSRM with SSAS)

I ran profiler, checking the ‘Query End’ duration on a simple distinct count of CustomerKey, with no other filters or attributes involved.

TabularQueryTimeByCores

You can see that dropping from 32 cores across 2 NUMA nodes down to 16 cores on a single node had almost no impact at all.

Within a single NUMA node, the performance dramatically improved as the number of cores increased, but as soon as a second NUMA node is added, the performance flat lines, with no further significant improvement no matter how many cores are added.

As per my last post – I’m sure there are other things afoot with this server, so this behaviour may not be representative of other setups, however it again reinforces advice you will have already seen elsewhere, that with SSAS Tabular – avoid NUMA hardware…

Frog-Blog out

SSAS Tabular performance – DefaultSegmentRowCount

I’m currently investigating a poorly performing Tabular model, and came across some interesting test results which seem to contradict the advice in Microsoft’s Performance Tuning of Tabular Models white paper.

Some background:

  • 7.6Gb SSAS tabular cube, running on a 2 x CPU 32 core (Xeon E5-2650 2Ghz, 2 x NUMA nodes) server with 144Gb RAM
  • SQL Server 2012 SP1 CU7 Enterprise
  • 167m rows of data in primary fact
  • 80m distinct CustomerKey values in primary fact
  • No cube partitioning

A simple distinct count in DAX of the CustomerKey, with no filtering, is taking 42 seconds on a cold cache. Far too slow for a tabular model. Hence the investigation.

p88 of the Performance Tuning of Tabular Models white paper discusses the DefaultSegmentRowCount, explaining that it defaults to 8m, and that there should be a correlation between the number of cores and the number of segments. [The number of segments calculated as the number of rows divided by the segment size].

It also indicates that a higher segment size may increase compression, and consequently query performance.

Calculating the number of segments for our data set, gives us the following options:

Rows 167,000,000
Segment Size # Segments
1048576 169
2097152 80
4194304 40
[default] 8388608 20
16777216 10
33554432 5
67108864 3

So, with 32 cores to play with, we should be looking at the default segment size (8m) or maybe reduce it to 4m to get 40 segments. But the extra compression with 16m segment size may be of benefit. So I ran some timing tests on the distinct count measure, and the results are quite interesting.

DefaultSegmentRowSize

It clearly shows that in this environment, reducing the DefaultSegmentRowSize property down to 2m improved the query performance (on a cold cache) from 42s down to 27s – 36% improvement. As well as this, processing time was reduced, as was compression.

This setting creates 80 segments, 2.5 times the number of cores available, but achieved the best performance. Note that the server’s ProcessingTimeboxSecPerMRow setting has been set to 0 to allow for maximum compression.

There’s more to this systems’s performance problems than just this, NUMA for a start, but thought I’d throw this out there in case anyone else is blindly following the performance tuning white paper without doing your own experimentation.

Each environment, data set and server spec is different, so if you need to eek out the last ounce of performance, run your own tests on the SSAS settings and see for yourself.

Frog-Blog Out

[Update: Follow up post exploring the performance impact of NUMA on this server]

Dimension ProcessAdd in SSAS

When you have very large dimensions in SQL Server Analysis Services (SSAS) cubes, the time taken to process them can cause a problem. This post builds upon an excellent walkthrough of the ProcessAdd option written by Daniel Calbimonte on MSSQLTips.com. and shows how to automate the identification of new data to be added.

What are the main options for processing a dimension?  (Technet details here)

  • Process Full – the whole dimension is dropped and rebuilt
  • Process Update – a new copy of the dimension is processed in full and the results compared, with changes being copied to the original dimension
  • Process Add – just add new rows to the dimension, don’t do anything with existig rows

There are lots of posts around the internet that discuss the pros and cons of each of these, so I won’t go into the details. But as the title suggests, we’ll focus on the ProcessAdd option.

How do you implement ProcessAdd?

Daniel provides a fantastic explanation and code walkthrough of how to use SSIS (Integration Services) to implement a ProcessAdd solution, please read through his post before continuing. He talks us through using an SSIS data flow to create a source query to retrieve the new rows, and then the Dimension Processing component as a destination to receive the new rows.

ProcessAdd01

And this works really well. The problem then becomes how to identify which rows need to be added. This is a problem that Daniel overcomes by having a seperate table of records to be added. The source query just selects the entire table. This table could be populated by the ETL logic, a trigger, or any number of other processes. But can we improve upon this and make the proccess more seamless and transparent?

Getting all MDX

For me, the best way of enhancing this is to build a solution which self maintains. So, we need to look inside the cube dimension, find the last record that has already been added, and then use that to find any new records from the warehouse.

Step 1 – Find the maximum key for a dimension

1
2
3
4
5
6
7
 WITH MEMBER [Measures].[MaxKey] AS
   MAX([SOURCE Currency].[SOURCE Currency Code].ALLMEMBERS
   , StrToValue([SOURCE Currency].[SOURCE Currency Code].currentmember.MEMBER_KEY))
 SELECT
   {[Measures].[MaxKey]} ON 0
 FROM
   [Adventure Works]

The MDX above creates a calculated member called MaxKey, which finds the maximum Key (the surrogate key identifier for the dimension) across all members of the dimension. Note that we have to use StrToValue() to convert the key from a string to an integer so that it sorts numerically not alphabetically. If you run this in SSMS against the AdventureWorks2012DW database the result should be 105. You can verify this by querying the DimCurrency table in the warehouse and you’ll find that 105 is the largest CurrencyKey value.

ProcessAdd02

Step 2 – How do we use MDX within SSIS?

In the Control Flow of the SSIS package, add an ‘Execute SQL’ task. Connect it to a new OLE DB connection to Analysis Services, as per the following screenshot.

ProcessAdd03

 

Set the SQL Statement of the Data Source to the MDX query, and set the ResultSet property to ‘Single Row’.

ProcessAdd04

On the Result Set tab, set the [Measures].[MaxKey] result to be saved to a new variable; ‘User::MaxKey’ (An Int32).

ProcessAdd05

The Execute SQL task should precede the existing Data Flow Task.

ProcessAdd08

Step 3 – Dynamically create the SQL Query from the MDX results

Now we have the maximum Key value stored in a variable, we can create another variable, called ‘SQL’ (a string), with an expression which includes the Key value.

ProcessAdd06

The expression should be

1
2
3
 "SELECT  CurrencyKey, CurrencyAlternateKey, CurrencyName
  FROM    DimCurrency
  WHERE   CurrencyKey&gt;" +  (DT_WSTR,20)@[USER::MaxKey]

This builds a SQL statement dynamically, based upon the maximum key identified from the MDX statement.

We can then set the source query in the Data Flow to use the resulting SQL Query, by setting its ‘Data Access Mode’ to ‘SQL Command From Variable’, and the ‘Variable Name’ to ‘User::SQL’.

ProcessAdd07

And that’s it. This process will automatically find any new dimension members in the warehouse table, and add them into the cube dimension.

This process does rely on a number of things – so watch out…!

  • The dimension must only contain inserts, not updates or deletes (you’ll need to perform a ProcessUpdate for those)
  • The Key must always be incrementing – new records with a smaller key will not get picked up!
  • You’ll need to ignore duplicate key errors during processing. Plenty of discussion around why in other blog posts…
  • SSIS is not great at handling MDX connections – unless you want to go down the linked server route, only ever have a single MDX query per SSIS package.

And finally, another shout out to Daniel Calbimonte and MSSQLTips for providing the basis of this post.

Frog-Blog-Out

 

LogParser Performance Inserting into SQL Server

Recently I’ve been using LogParser a lot to import log files, such as W3C logs from an IIS web server. The distasteful format of these files is made palatable by the awesomeness that is LogParser; it just takes care fo it for you. (Check out this SQLBits session on LogParser by Jonathan Allen (Twitter | Blog) for a good intro to LogParser)

However I’ve been suffering with very poor performance with large files, so started to investigate further.

Exporting a 5Mb IIS log to a csv file was taking 0.4 seconds, whereas exporting the same log to SQL Server was taking over 15 times longer. This isn’t a problem for a small file, but processing a 0.5Gb log file is somewhat of a different matter!

Performing a trace on SQL Server during an import was revealing that each record being inserted was wrapped up in its own transaction. Importing 100,000 records into SQL was generating 100,000 insert statements, and 100,000 transactions. In the trace below you can see the transactions either side of the insert (using exec sp_cursor…).

Log Parser Transactions

This was executed using the following command

  • LogParser.exe “SELECT * INTO staging.test FROM Test.log” -i:IISW3C -o:sql -server:localhost -database:Staging -cleartable:ON

 

I have to admit that I’m surprised by this – I assumed that LogParser would open up a single transaction for the bulk insert. It turns out however that you can control this behaviour using LogParser’s transactionRowCount parameter.

The ‘-transactionRowCount’ parameter defaults to 0, which auto commits every single row. It can be changed to any number you like, or to -1, which will commit all rows in a single transaction. The choice is yours.

By changing this parameter, the performance difference can be huge on a large file.

The following is an extract of the trace on the same file when executed using the following command

  • LogParser.exe “SELECT * INTO staging.test FROM Test.log” -i:IISW3C -o:sql -server:localhost -database:Staging -cleartable:ON -transactionRowCount:-1

Log Parser Single Transaction

You can clearly see that only a single transaction is opened, with all inserts batched together.

For processing large files this can make a very significant different. For an 11Mb IIS log with just over 18k records on one of my test servers:

  • Default behaviour: 38.26 seconds
  • Single transaction: 2.92 seconds

This is a performance improvement of 92% – which you can’t grumble about.

If you’re calling LogParser from C# script within SSIS (as I usually do) then you can set this same behaviour using the transactionRowCount property of the LogQueryClassClass object (before you call ExecuteBatch!).

 

Frog-Blog-Out

Power BI Sentinel
The Frog Blog

Team Purple Frog specialise in designing and implementing Microsoft Data Analytics solutions, including Data Warehouses, Cubes, SQL Server, SSIS, ADF, SSAS, Power BI, MDX, DAX, Machine Learning and more.

This is a collection of thoughts, ramblings and ideas that we think would be useful to share.

Authors:

Alex Whittles
(MVP)
Jeet Kainth
Jon Fletcher
Nick Edwards
Joe Billingham
Lewis Prince
Reiss McSporran
Microsoft Gold Partner

Data Platform MVP

Power BI Sentinel
Frog Blog Out
twitter
rssicon