0845 643 64 63

SQL Server

1 2 3 5

My experience in obtaining an MCSE

Background

I have recently completed my MCSE in Data Management and Analytics, and I wanted to share my experience of working towards and passing the exams that have led to me getting this MCSE. In this post I will cover some of the challenges I faced and provide some tips and advice, hopefully others following the same certification path, or other paths, will find this useful.

I am a business intelligence developer at Purple Frog, I have several years of experience working with data, I started off as a data analyst and then went into reporting and more recently have been working on ETLs, data warehousing and cubes. I have been working with SQL throughout my various roles and therefore the best place for me to start my certification path was with the exam 70-761: Querying Data with Transact-SQL. The path I’ve taken up to this point is:

MCSA: SQL 2016 Database Development
– 70-761: Querying Data with Transact-SQL
– 70-762: Developing SQL Databases

MCSE: Data Management & Analytics
– 70-767: Implementing a SQL Data Warehouse (from the Elective Exam Pool)

The learning material

Although the first exam (70-761) fitted in quite well with my SQL background (I probably knew about 75% of the material beforehand), there was still some work required for me to get to the stage where I felt I was confident in passing the exam. For me, the best resource and my primary resource for learning the material has been the Exam Ref books, so for example “Exam Ref 70-761 Querying Data with Transact-SQL”. These books are structured in a way that the content is split into the sections covered by the exam, for example the books contain a chapter for every skill covered in the exam.

The one downside to the Exam Ref books is that at times it can feel quite wordy if you’re relying on the book alone, so what I found really useful was to supplement this with videos and demos on the topics where I needed a greater understanding. In addition to this, practice and doing exercises helped me to further understand the different concepts as I was able to try what I had learnt and see where I was going wrong.

The final resource that I found useful was Microsoft Docs (https://docs.microsoft.com/en-us/), this is a really good reference point for key facts, for example I found the page on CDC really useful for my latest exam (70-767).

The exam

There are the obvious tips such as sleep early the night before, get to the exam centre with time to spare and so on, but I wanted to share some of the exam techniques I found useful while doing these exams.

My top tip is check that you have covered off and are comfortable with all the skills measured in the exam, the skills measured for each exam can be found in the “Browse Certifications and Exams” section on the Microsoft Learning website (example shown below for exam 70-761). The skills are also stated in the Exam Ref books and as mentioned before the chapters in the book are based on the skills measured in the exam.

This was taken from https://docs.microsoft.com/en-us/learn/certifications/exams/70-761

What’s useful about the skills measured shown above is that it shows the weight of questions per skill in the exam. This is useful because you can work out if you need to focus on a weaker area if that area is a big part of the exam.

Time shouldn’t be an issue in the exam if you’ve prepared well, however some questions are not worded in the best way and can catch you out so do take the time to read each question properly, and do keep an eye on the time remaining after every 5-10 questions.

You have the option to flag questions and review them again later (note some questions cannot be flagged), make use of these flags for questions you are unsure of. This can be particularly useful if you’ve flagged a question and then a later question gives you a clue or reminds you of the answer for the question flagged earlier. Alternatively, you should be provided with a pen and wipeable board where you can make notes so note down the question number and topic so that you can come back to it later.

Next steps

I am currently studying towards the exam 70-768: Developing SQL Data Models, this will help develop my understanding and knowledge of data modelling and working with cubes and will also help me get the certification for MCSA: SQL 2016 BI Development. With these current certifications being retired in the near future the next plan is to work towards the certification Microsoft Certified: Azure Data Engineer Associate.

I hope you have found this blog useful and that you can use some of the tips mentioned in your study plans, all the best!

SQL LAST_VALUE() wrong results

You may have come across the following scenario when using the LAST_VALUE() function.

You want to find the first or last value loaded in a result set, therefore you use the FIRST_VALUE() and LAST_VALUE() functions like below:

You expect to get “A” for every record in the FirstValue column and “E” for every record in the LastValue column.
However you see the following output:

LAST_VALUE() on its own is implemented from the current row going back to the first row, for example:

  • Row 1 – The only value in scope is A.
    Hence, FIRST_VALUE() & LAST_VALUE() both return A.
  • Row 2 – The values now in scope are A & B.
    Hence, FIRST_VALUE() returns A and LAST_VALUE() returns B.
  • Row 3 – The values now in scope are A, B, C.
    Hence, FIRST_VALUE() returns A and LAST_VALUE() returns C.
  • And so on…

To return the actual last value, add the additional clause ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. This ensures that for each row the LAST_VALUE() function looks at all rows; from the very first row to the very last row. Now, you have the following code (the original code with an extra column using the additional clause):

This gives you the following output, LastValue2 shows the true last value for the result set:

SQL Server 2016 Licensing Changes, SP1, v.Next CTP1, etc.

connect2016By now you’ve probably heard the exciting news announced at today’s Microsoft Connect conference; SQL Server licensing is changing, significantly, from SQL 2016 SP1.

The functionality of SQL Server Standard, Web & Express editions are being brought up to [almost, 99%] match that of Enterprise Edition. This has a number of critical impacts:

  1. Developers can write applications using a single set of SQL Server functionality, and can now make use of previously Enterprise only features without imposing the cost on their customers.
  2. The customer now decides whether to use Express, Web, Std or Ent based on scaleability, not functionality.
  3. It follows the pricing model already used by most cloud services; pay for usage rather than features.
  4. This brings security to the masses – possibly one of the biggest impacts is that now we can use Row Level Security, Dynamic Data Masking and Always Encrypted in ALL editions of the engine – huge news.

sqlserverThere are some limitations to this though:

  1. It currently only applies to the SQL engine, not to SSIS, SSAS or SSRS – Please Microsoft, follow suit with the BI tools asap!!!
  2. Some features are not [yet] available in the Express and/or Local DB editions, including Change Data Capture, In Memory OLTP and Polybase to name a few.
  3. The trigger to move to Enterprise will now be [almost] purely driven by size and performance requirements.

As if that isn’t enough, other huge announcements include:

  • SQL Server 2016 SP1 is now available
  • New functionality in 2016 SP1, including
    • CREATE OR ALTER functionality for Stored Procs, Views, Functions, etc.
    • DBCC CLONEDATABASE – To easily clone a database schema without the data (love this!!)
    • NUMA support for Analysis Services Tabular – I’ll be checking this out in more detail soon and will blog my findings
  • SQL Server v.Next CTP1 (Community Technology Preview) is now available, which includes SQL Server running on Linux! Note that only the engine is available on Linux at this time, not the BI tools but watch this space, the plan is to offer matching functionality on both

 

For full details, check out the Microsoft blog post here

/Frog-Blog-Out

Excel doesn’t open CSV files correctly from sp_send_dbmail

A nice little nugget today for anyone who uses SQL dbmail to send data in csv attachments.

Sometimes, depending on your language settings, CSV files don’t open correctly in Excel. Excel ignores the ‘,’ and puts all data in column A. You then have to go through the hassle of closing the file and opening it throug the flat file wizard.

 

excelcsv1

There’s a very nice solution to this that I found in this thread. It’s to insert a new row in the top of the file containing “sep=,”. This forces Excel to understand that it is a comma delimited file, and ensures that it will open correctly.

The only problem, how do you insert a header line when using dbmail?…

[Caution – ugly hack alert…]

You can alter the name of the first column to include this header text. We simply rename “Column1” to “sep=,{CR}{LF}Column1”. Then when dbmail prints out the column headers in the file, the Column1 name will be split on two lines, preceeded by “sep=,”.

excelcsv2

Excel treats this first row as an instruction and does not display it, just uses it to make sure it formats the data correctly.

So how do we do this in SQL?

DECLARE @Column1Name VARCHAR(255)
DECLARE @Query VARCHAR(2048)
SET @Column1Name = '[sep=,' + CHAR(13) + CHAR(10) + 'Column1]'
SET @Query = 'SELECT Column1 AS ' + @Column1Name + ', Column2, Column3 FROM myTable'

EXEC msdb.dbo.sp_send_dbmail
@profile_name=XXX
,[Other parameters as required]
,@query=@Query
,@attach_query_result_as_file=1
,@query_attachment_filename='QueryResults.csv'
,@query_result_separator=',' --enforce csv
,@query_result_no_padding=1 --trim
,@query_result_width=32767  --stop wordwrap

And there you go. It’s a bit of a hack, but works well.

Frog-Blog-Out

Analysis Services Tabular or Multidimensional? A performance comparison

molapviolinsummarychartCan SSAS Multidimensional be faster than SSAS Tabular for distinct counts on large datasets?

We’ve all seen how fast the SSAS Tabular model can work – it’s an in-memory, heavily compressed analytics database. However you may have seen my previous posts on Tabular and NUMA, which show that at the moment Tabular is limited to a single CPU. This obviously limits its scalability and maximum performance.

The situation

A large project I’m working on, for a multinational telecoms provider, has a problem with the query performance with their current Tabular model. It’s fine for smaller datasets but does not scale well, specifically with distinct count operations. The distinct count is the single biggest critical factor for performance in this project. With in the region of 50m distinct customers, this is a sizeable problem. After working with Microsoft to investigate the Tabular performance we identified a bug in SSAS, which has been resolved in SQL 2012 SP1 CU9, but still doesn’t work fast enough on the limited hardware.

The Multi-dimensional model is NUMA aware (Although how well it is supported depends on your workload), so can make better use of hardware, but we all know how poorly MOLAP scales with distinct counts, due to the storage engine IO bottleneck. So what is the answer?

What about Solid State?

The reason the multidimensional model often experiences performance degredation for distinct count operations is the heavy disk IO required. Aggregations can’t be used effectively, so the SSAS storage engine has to work through the entire dataset on disk. With the speed of normal disks, especially when on a SAN, this is not great.

There are now a number of solid state storage options available that provide enterprise grade storage with some very impressive performance.

So the question is, can a multidimensional cube running on solid state storage outperform an equivalent in-memory Tabular cube?

The tests

I ran various query performance tests with (MOLAP) cubes from 0.5Tb to 2Tb. The results in this post are focusing entirely on the 0.5Tb tests, with an identical dataset and equivalent test queries on a Tabular cube.

A big thank you to to the following:

  • Millicom – for the test dataset, and for allowing me to publish the tests results here
  • Violin – for the use of a storage array for the test
  • Microsoft – for the loan of a server in the MTC (Microsoft Technology Center) in Reading.

The equipment for the multidimensional cube:violinarray

  • HPDL580
  • 4 x Intel Xeon E7-4850 2Ghz
  • 40 cores, hyperthreaded to 80
  • 256Gb RAM, 1067Mhz
  • Violin 6616 SLC, 1m IOPS, 4Gbps
  • SQL Server 2012 SP1 CU9
  • Distinct count measure groups matrix partitioned by month, and 60 non-overlapping buckets of CustomerKey
  • Basic usage based optimisation applied to aggregations. There is a high liklihood this could be improved further

sqlserverThe equipment for the Tabular cube

Note that the different number of CPUs is fair, Tabular is not able to benefit from multiple CPUs, Multidimensional is. The factors that are unequal are the threads per CPU (16 Vs 20) and memory speed (666Mhz Vs 1067Mhz), please bear this in mind when looking at the performance comparison results.

The dataset is 6 monthly snapshots of 60m customers in two primary fact tables with 300m and 1.4bn rows in each.

The 7 query tests cover a range of workloads from very simple to very complex. Queries 1&2 do not include distinct count measures, all other queries include a distinct customer count from one of the primary fact tables, with varying levels of complexity, granularity and result set size. Tests 1-4 are manually created MDX queries, 5-7 are real-world Excel workbooks provided by users, each containing 1-6 large pivot tables.

Each test was run on a cold SSAS cache, and with the windows file cache also cleared before every test. Every test was run 3 times, with the average taken.

Test scenarios:

  • Tabular, using the fastest performance measured from all previous configuration tests
  • Multidimensional, using all CPUs and local Raid 5 disk storage
  • Multidimensional, using a single CPU (single NUMA node) and Violin storage
  • Multidimensional, using all CPUs and Violin storage

The Results

molapviolinresulttable

And when plotted on a chart…

molapviolinresultchart

The first 3 tests are so quick that they don’t show up on the above chart, but you can see these results in the table above.

Conclusion

It’s clear from these tests that solid state storage can breathe new life into an IO constrained multidimensional cube. This really applies to any case where you’re making heavy use of distinct counts over a large, high cardinality dataset. In this case the average performance gain over the same disk based cube was 2.6x.

Multidimensional on solid state storage can be on average 1.9x faster than the Tabular model, when working with large high cardinality distinct count operations.

Although other research and evidence suggests that Multidimensional cubes don’t perform well with multiple NUMA nodes, these tests show that if your workload is very heavily constrained by the storage engine, you can gain significant performance from multiple CPUs/NUMA nodes.

And Finally…

As well as basic timing tests using Profiler, I also monitored the CPU and disk throughput in Perfmon. The following two screenshots show the difference between the multidimensional cube on disk and then the same cube on Violin. This is for test 4, which showed the most significant performance improvement from disk to solid state.

perfmondisk

perfmonviolin

These traces show test 4, which is has the following query structure:

SELECT {   [Measures].[Fully Additive Measure 1]
         , [Measures].[Fully Additive Measure 2]
         , [Measures].[Fully Additive Measure 3]
         , [Measures].[Distinct Measure 4]} ON COLUMNS ,
NON EMPTY {[Dimension 1].[Attribute 1].[Attribute 1].MEMBERS  --6 members
         * [Dimension 2].[Attribute 2].[Attribute 2].MEMBERS  --12 members
         * [Dimention 3].[Attribute 3].[Attribute 3].MEMBERS  --5 members
         * DESCENDANTS ([Date].[Calendar].[Year].&[2013], [Date].[Calendar].[Month])} ON ROWS  --6 members
FROM [Cube]

This returns 700 distinct tuples from the crossjoin, with an average of 400k distinct customers per tuple.

This shows the potential benefit of maximising the storage IO throughput. In the first image, the IO throughput being constrained to under 100Mbps means that only 3% of the CPU can be utilised. In the second trace, by providing up to 1.7Gb per second throughput, the CPU is allowed to work at an average of 60% for the duration of the query, with a clear impact on the query performance.

These charts also shows the reduced IO latency (IO response time). With the average IO response time in the first trace showing between 40-60ms for the disk IO, and the barely registering on the scale of the chart for the Violin array. I found the IO response for the Violin tests was averaging at 450μsec. The SLC unit runs at ~200μsec latency at the 4k chunk size, so the 8k chunks we’re asking for fully complete in the 450μsec that we saw.

There’s plenty more details and tests that I’m running, I may post further details if I get time. Until then…

Frog-Blog-Out

Update #1 06/06/2014:
In the interests of completeness and fairness, I subsequently re-ran the Tabular model performance tests on the same (Microsoft supplied) server as the MOLAP model. The only difference being that to maximise the performance I restricted the server to a single CPU/NUMA node.
Somewhat surprisingly, the performance difference between Tabular and MOLAP/Violin was even greater. Multidimensional performed 3.7x faster, compared with 1.9x faster on the original Tabular server.

Update #2 06/06/2014:
If you’re planning on implementing a large scale Tabular model, do take the above tests into consideration, however please be aware that there are reasons why the Tabular performance is poor here. The incredibly high cardinality of the distinct count field, and the data can’t be split up, remodelled or simplified. This will not apply to all data scenarios.
Please take a look at a recent white paper by Alberto Ferrari at SQLBI, in which he discusses a number of ways of optimising large distinct count data sets to suit the Tabular model. He’s achieved some very impressible results. Unfortunately these methods would not benefit the scenario described in this post, but it shows what can be done with Tabular if the data suits.
http://www.sqlbi.com/articles/using-tabular-models-in-a-large-scale-commercial-solution

LogParser Performance Inserting into SQL Server

Recently I’ve been using LogParser a lot to import log files, such as W3C logs from an IIS web server. The distasteful format of these files is made palatable by the awesomeness that is LogParser; it just takes care fo it for you. (Check out this SQLBits session on LogParser by Jonathan Allen (Twitter | Blog) for a good intro to LogParser)

However I’ve been suffering with very poor performance with large files, so started to investigate further.

Exporting a 5Mb IIS log to a csv file was taking 0.4 seconds, whereas exporting the same log to SQL Server was taking over 15 times longer. This isn’t a problem for a small file, but processing a 0.5Gb log file is somewhat of a different matter!

Performing a trace on SQL Server during an import was revealing that each record being inserted was wrapped up in its own transaction. Importing 100,000 records into SQL was generating 100,000 insert statements, and 100,000 transactions. In the trace below you can see the transactions either side of the insert (using exec sp_cursor…).

Log Parser Transactions

This was executed using the following command

  • LogParser.exe “SELECT * INTO staging.test FROM Test.log” -i:IISW3C -o:sql -server:localhost -database:Staging -cleartable:ON

 

I have to admit that I’m surprised by this – I assumed that LogParser would open up a single transaction for the bulk insert. It turns out however that you can control this behaviour using LogParser’s transactionRowCount parameter.

The ‘-transactionRowCount’ parameter defaults to 0, which auto commits every single row. It can be changed to any number you like, or to -1, which will commit all rows in a single transaction. The choice is yours.

By changing this parameter, the performance difference can be huge on a large file.

The following is an extract of the trace on the same file when executed using the following command

  • LogParser.exe “SELECT * INTO staging.test FROM Test.log” -i:IISW3C -o:sql -server:localhost -database:Staging -cleartable:ON -transactionRowCount:-1

Log Parser Single Transaction

You can clearly see that only a single transaction is opened, with all inserts batched together.

For processing large files this can make a very significant different. For an 11Mb IIS log with just over 18k records on one of my test servers:

  • Default behaviour: 38.26 seconds
  • Single transaction: 2.92 seconds

This is a performance improvement of 92% – which you can’t grumble about.

If you’re calling LogParser from C# script within SSIS (as I usually do) then you can set this same behaviour using the transactionRowCount property of the LogQueryClassClass object (before you call ExecuteBatch!).

 

Frog-Blog-Out

Database Management PhD survey

If you’re involved in database management, can you spare 20 minutes to complete a survey?

Victoria Holt is studying for her PhD, and would very much appreciate it if you could help her out with her survey.

Go on… Fill it in whilst you have a break over a cup of tea & a mince pie…

Full in the survey here

On behalf of Victoria, thanks

Rock Paper Scissors Lizard Spock

(Image courtesy of GraphJam.com)

I want to share with you some genius level work by Phil Quinn (blog | twitter), who is furthering the boundaries of SQL Server.

 

Phil has taken the spatial drawing code from an earlier Frog-blog and used it to create a fully functional Rock Paper Scissors Lizard Spock game, just using the power of SQL Server. Not even the brightest boffins at Microsoft could ever have imagined that SQL Server would be able to provide such an awesome service to the World.

 

Check out Phil’s post and download the query here

 

If you want to hear Phil speak, he’s doing a talk for SQL Midlands on 22nd November in Birmingham. His talk will be on using XML in SQL Server, but I’m hoping that we’ll also be treated to a bit of Rock Paper Scissors Lizard Spock action…

 

 

Frog-Blog-Out

MSc Dissertation – Performance of Loading SCDs in SSIS

Well after 3.5 years, I’ve finally completed my MSc Business Intelligence – hoorah! And to reward the time, effort and increased grey hair, they saw fit to give me a merit as well.

During the last year I’ve been writing a thesis investigating the performance characteristics of loading data into data warehouse dimensions. Specifically loading Type 2 SCDs using SSIS.

For those who have followed the previous posts and my conference talks on using T-SQL Merge for the purpose, you won’t be surprised at the direction of the dissertation, but it provides a useful performance comparison between T-SQL Merge, SSIS Merge Join, SSIS Lookup and the SSIS SCD Wizard.

I won’t go into the full details here of the project or results, but will show a couple of the summary charts which are of most interest. You can download the full project here:

The charts below shows the duration taken for the Lookup, Merge and Merge-Join methods (SCD Wizard excluded for obvious reasons!).
The top chart shows the performance on a Raid 10 array of traditional hard disks.
The second chart shows the same tests run on a Fusion IO NAND flash card.

The charts clearly show that the Lookup method is the least favoured. Of the other two, Merge is [just] preferred when using solid state, although statistically they are equivalent. On HDDs, Merge and Merge-Join are equivalent until you’re loading 2-3m rows per batch, at which point Merge-Join becomes the preferred option.

Full test results and analysis in the PDF download above.

My previous few posts show how using a T-SQL approach like Merge can provide huge development benefits by automating the code. This research now shows that unless you’re loading very large data volumes the performance is equivalent to more traditional approaches.

Hope this is of use. If you want to know a bit more without reading the full 99 pages & 23k words (who could blame you?!), then my SQLBits talk video is now on-line here. This talk is slightly out of date as it was presented before I’d finished the research and analysis, but it’s largely accurate. I presented a more up to date version on a webinar for the PASS Virtual BI chapter. The recording isn’t currently available [When this post was written] but should be up soon. Keep checking on the BI PASS Chapter website.

Frog-Blog-Out

SQLBits X Video available

SQLBits X Video Now available

The video of my talk at SQLBits X is now available on the SQLBits website here. The talk was focused on presenting the results of my MSc Business Intelligence dissertation, comparing the performance of different methods of using SSIS to load data warehouse dimensions, specifically type 2 SCDs.

The talk also covers a comparison of the performance between traditional hard disks and solid state storage systems such as Fusion IO.

I then present a method of using the T-SQL Merge statement to automate a significant part of the ETL process.

You can find the code behind the demos on various recent Frog-Blog posts, and there is more to come, so look back soon!

 

PASS BI Virtual Chapter Talk

For those that would rather hear the talk presented live, or want to ask any questions, please join me at a repeat of this talk over Live Meeting for the PASS BI Virtual Chapter on Thursday 28th June, at 5pm UK time, 12pm EDT (US). You can find the details on the PASS BI chapter website here

1 2 3 5
Power BI Sentinel
The Frog Blog

Team Purple Frog specialise in designing and implementing Microsoft Data Analytics solutions, including Data Warehouses, Cubes, SQL Server, SSIS, ADF, SSAS, Power BI, MDX, DAX, Machine Learning and more.

This is a collection of thoughts, ramblings and ideas that we think would be useful to share.

Authors:

Alex Whittles
(MVP)
Reiss McSporran
Jeet Kainth
Jon Fletcher
Nick Edwards

Data Platform MVP

Power BI Sentinel
Frog Blog Out
twitter
rssicon