0845 643 64 63

Alex

1 2 3 11

Capturing Insert and Update Counts from Merge

This post shows hows how you can capture and store the number of records inserted, updated or deleted from a T-SQL Merge statement.

This is in response to a question on an earlier post about using Merge to load SCDs in a Data Warehouse.

You can achieve this by using the OUTPUT clause of a merge statement, including the $Action column that OUTPUT returns.

The basic syntax is:

 
INSERT INTO XXXX
SELECT [Action]
 FROM
 (
 MERGE      XXXX AS Target
      USING XXXX AS Source
         ON XXXX=XXXX
      WHEN MATCHED
         AND XXXX <> XXXX
     THEN UPDATE SET
         XXXX=XXXX
     WHEN NOT MATCHED THEN
           INSERT (
              XXXX
           ) VALUES (
              XXXX
           )
     OUTPUT $action AS [Action]
     ) MergeOutput

You wrap the Merge statement up as a sub-query, adding the OUTPUT clause to return details about what happened in the merge. Note that you can’t just select from this sub-query, there has to be an INSERT INTO statement.

One row will be returned for each row touched by the merge process.

The $action column will contain either INSERT, UPDATE or DELETE, to indicate what happened to that row.

You can also include Source.* in order to include the source column values in the output dataset.

You can also include DELETED.*, which returns the values of any updated records before they were updated, and INSERTED.* to show the values after the updated. In reality the records are not deleted or inserted, just updated, but DELETED/INSERTED is used as the terminology for old/new values either side of the update. When inserting a new record, DELETED values will be NULL.

 ... OUTPUT $action AS [Action], Source.*, DELETED.*, INSERTED.*
     ) MergeOutput 

You can then refer to this ‘MergeOutput’ result set at the top of the query by selecting from this sub-query.

There is a limitation, though, you can’t aggregate the table. So if we want to summarise the actions into a single row of insert, update and delete counts, we have to use a temporary table such as in the sample code below.

 
CREATE TABLE #MergeActions ([Action] VARCHAR(10)) 

 INSERT INTO #MergeActions ([Action])
 SELECT [Action]
 FROM
 (
 MERGE      [dim].[Date] AS Target
      USING [etl].[DimDate] AS Source
         ON ISNULL(Target.[DateKey],0) = ISNULL(Source.[DateKey],0)
      WHEN MATCHED
         AND (Target.[Date] <> Source.[Date]
          OR Target.[Month] <> Source.[Month]
          OR Target.[Quarter] <> Source.[Quarter]
          OR Target.[Year] <> Source.[Year]
         )
     THEN UPDATE SET
         [Date] = Source.[Date]
        ,[Month] = Source.[Month]
        ,[Quarter] = Source.[Quarter]
        ,[Year] = Source.[Year]
        ,LastUpdated = GetDate()
     WHEN NOT MATCHED THEN
           INSERT (
              [DateKey]
             ,[Date]
             ,[Month]
             ,[Quarter]
             ,[Year]
             ,LastUpdated
           ) VALUES (
              Source.[DateKey]
             ,Source.[Date]
             ,Source.[Month]
             ,Source.[Quarter]
             ,Source.[Year]
             ,GetDate()
           )
     OUTPUT $action AS [Action]
     ) MergeOutput
 ;

 SELECT
      SUM(CASE WHEN [Action]='INSERT' THEN 1 ELSE 0 END) AS InsertCount
     ,SUM(CASE WHEN[Action]='UPDATE' THEN 1 ELSE 0 END) AS UpdateCount
     ,SUM(CASE WHEN [Action]='DELETE' THEN 1 ELSE 0 END) AS DeleteCounts
 FROM #MergeActions
 GROUP BY [Action]

 DROP TABLE #MergeActions 

</Frog-Blog Out>

Power BI Sentinel

We all know and love Power BI, because of it’s power and ease of use, there’s a good reason why it’s number 1 in the Gartner Quadrant. But how to we avoid ‘Power BI Hell’, the new version of our old nemesis ‘Excel Hell’?

How do we keep track of which data sources are being used by which datasets and reports? How to we perform an impact assessment of updating/changing a dataset? What happens if someone deletes a Power BI report? There has been no easy solution to this, until now.

Step forward a new SaaS tool, Power BI Sentinel

Sentinel is a new Software as a Service (i.e. hosted in the cloud) tool which sits alongside your PowerBI.com account, helping you manage things better. Core features include:

  • Data Lineage documentation – visibility of which reports are using which datasets, and which data sources. Plus showing which downstream datasets and reports will be impacted by deleting or changing a data source or dataset
  • Automated Backups – backing up your PBIX files directly to your own Azure storage. Allowing you to roll back, recover deleted files, and see what the report/data looked like in the past for debug/investigation
  • Automated Dataset Documentation – Generate online or PDF documentation of your Power BI datasets, the model, tables, DAX measures, calculated columns, etc.
  • Report Change Tracking – If someone’s broken a report by adding/removing a filter, changing a visual, etc. the change tracking will show you what has changed, and when, allowing you to roll back to the correct version prior to the change
Visualise data lineage from your Power BI tenant

The diagram above shows a sample Data Lineage diagram of a Power BI Dataset, showing which data sources (databases, files, websites, etc.) are used by the dataset, and which reports then consume this dataset.

Scenario 1: Power BI Governance, Compliance, GDPR

You have a data source, lets say an Oracle database, containing sensitive customer details (PII). When auditing your organisation’s usage of this data, which of your 600 Power BI reports do you need to review to assess whether that PII data is exposed or not, and to who?

The Power BI Data Lineage explorer allows you to select a datasource, and immediately see which datasets and reports consume data from that source, including those hidden away in users’ personal ‘My Workspace’ folders. Straight away refining the list of sensitive reports to a handful.

Scenario 2: Power BI Disaster Recovery

Your well behaved development teams obviously follow good DevOps practice for all their development work (hopefully!), but Power BI is a self service tool, allowing your business users to create their own reports. And you’re almost certainly not going to get business users using source control and good testing practices on their reports.

Self service is a very powerful feature which significantly enhances the benefit of Power BI to your organisation. But if ‘Mary from Finance’ creates a report that becomes business critical, if that report gets corrupted, broken, deleted or changed, there is no recovery or rollback to the previous version of that report.

Power BI Sentinel not only gives you visibility of whether these critical reports are hidden in personal workspaces, but also performs regular automated backups of the reports, allowing you to recover and roll back if required.

Summary

Power BI is a very powerful tool, but with great power comes great responsibility.

Power BI Sentinel is a great way of providing transparency over your Power BI tenant, and protecting you if anything should ever go wrong.

Sign up for a licence at PowerBISentinel.com

SSAS Tabular Deployment Wizard fails Newtonsoft.Json error

The Scenario

Deploying an Analysis Services Tabular model to SSAS Azure using the Analysis Services Deployment Wizard. Both Visual Studio 2017 & SQL Server 2017 installed on the client.

Try and click on the ellipses to change the data source connection string or impersonation information results in a Newtonsoft.json error:

“Could not load file or assembly ‘Newtonsoft.Json, Version 6.0.0.0, Culture=neutral, ……”

 

The Solution

If you look at the folder containing Newtonsoft.Json.dll (2C:\Program Files (x86)\Microsoft SQL Server\140\Tools\Binn\ManagementStudio\Extensions\Application” – substitute 140 for your version of SQL Server) and then right click properties on the Newtonsoft.Json.dll file you’ll probably find that it’s version 10.0.3 or later, but the error message shows that the deployment wizard is looking for version 6.0.x. We therefore have to tell the SSAS Deployment Wizard to use a newer version instead.

Use Notepad(++) to open the Microsoft.AnalysisServices.Deployment.exe.config file, which you should find in “C:\Program Files (x86)\Microsoft SQL Server\140\Tools\Binn\ManagementStudio” (substitute 140 as above, for your correct version).

Find the section for Newtonsoft.Json

And make a couple of changes (backup the file first!!!)

  • Add in a bindingRedirect line, redirecting 6.0.0.0 to 10.0.0.0
  • Change the codeBase version number to match the actual file version, 10.0.0.0

Save and close the file, relaunch the deployment wizard, and you should find it all works ok.

Frog-Blog Out

 

Lookups and If Conditions in Azure Data Factory v2 (ADFv2)

Azure Data Factory v2 (ADFv2) has some significant improvements over v1, and we now consider ADF as a viable platform for most of our cloud based projects. But things aren’t always as straightforward as they could be. I’m sure this will improve over time, but don’t let that stop you from getting started now.

This post provides a walk through of using the ‘Lookup’ and ‘If Condition’ activities to do some basic conditional logic depending on the results of a database query.

Assumptions: You already have an ADF pipeline created. If you want to hook into SSIS then you’ll also need the SSIS Integration Runtime set up – although this is not relevant just for the if condition.

Scene setting: We want to execute an SSIS package, but only if the package is enabled in a config table, and in a ‘Waiting’ state. We can determine this state by calling a stored proc in a database.

Solution: Use a ‘Lookup’ activity to call the stored proc, with the single row output saved to a Dataset. Then an ‘If Condition’ activity to check the results of an expression, followed by whatever activities you want to call when the expression equates to True or False. the overall pipeline will look something like this.

Now we’ll look at some of the configuration of the activities. Before we create the pipeline we need to add a Dataset (actually a data source), pointing to the Azure SQL DB that holds our stored proc. Click the + to add a Factory Resource, and select Dataset, followed by Azure SQL Database.

In the ‘Connection’ tab, click ‘+ New’ to create a Linked Service, and enter the connection string details to your Azure SQL DB. You don’t need to select a table, as this dataset will just be for calling a stored proc, so leave it unselected. I’ve called my dataset ‘ComponentState’.

Now add a pipeline by clicking the + in the Factory Resources pane but this time selecting ‘Pipeline’.

Firstly add a Lookup activity, and set its Source Dataset to the one you’ve just created, and then select the stored proc from the dropdown list. Or you could write a query here instead. If you’ve selected a Stored Proc, you can click on ‘Import Parameter’ to pre-populate the parameters into the section below.

Note that the ‘First row only’ checkbox is selected by default. This is correct if you only want a single result row. If you want the full resultset then untick this. (Note that the remaining steps here would require different expressions).

Then add the ‘If Condition’ activity, and link the two by dragging the green box from Lookup onto the If Condition activity:


In the Settings tab, we need to specify the condition expression to use. We want to compare the ‘ComponentState’ field of the proc to the string ‘Waiting’. So we use the equals() function, although we could also use ==.

The expression is:

@{equals(activity(‘Lookup1′).output.firstRow.ComponentState,’Waiting’)}

The @{ } denotes an expression that will be evaluated. Anything outside these brackets will not get evaluated, just left as a string. The first parameter to the equals function is “activity(‘Lookup1’).output.firstRow.ComponentState“, or the ComponentState field of the first row of the output of the Lookup1 activity. We then compare this to the hard coded string ‘Waiting’.

This will either return True or False, the outcomes of which we can configure in the ‘Activities’ tab,

In the ‘If true Activities’ section, just add the ADFv2 activities that you want to happen if your expression evaluates to True, and the same for False. Note that when you edit this, you get a complete canvas to add as many activities you want, I’ve just added one here, to call an SSIS package.

Hope that’s been useful.

<Frog-Blog Out>

Sampling data in Data Lake U-SQL for Power BI

Being able to hook Power BI directly into Azure Data Lake Storage (ADLS) is a very powerful tool (and it will be even more so when you can link to ADLS files that are in a different Azure account!! – not yet available as at January 2017). However there is a problem, Data Lake is designed to scale to petabytes of data whereas Power BI has a 10GB limit. Yes this is compressed, so we’d expect around 100GB of raw data, however to load this you need 100GB+ of RAM available on your PC, so it’s hard to actually reach the limit with a single dataset.

There’s obviously a disconnect in scalability here. In some datasets we can just use U-SQL to aggregate the data and pre-summarise by the list of fields that we actually want to analyse, and this is fine for additive transactional data. However if we need a many to many link or the granular details of individual rows of data then there’s an issue, how to we get this data into Power BI?

The answer is sampling, we don’t bring in 100% of the data, but maybe 10%, or 1%, or even 0.01%, it depends how much you need to reduce your dataset. It is however critical to know how to sample data correctly in order to maintain a level of accuracy of data in your reports.

Option 1: Take the top x rows of data
Don’t do it. Ever. Just no.
What if the source data you’ve been given is pre-sorted by product or region, you’d end up with only data from products starting with ‘a’, which would give you some wildly unpredictable results.

Option 2: Take a random % sample
Now we’re talking. This option will take, for example 1 in every 100 rows of data, so it’s picking up an even distribution of data throughout the dataset. This seems a much better option, so how do we do it?

— a) Use ROW_NUMBER() and Modulus
One option would be to include a ROW_NUMBER() windowing function in a U-SQL query that allocates each row a unique number.
ROW_NUMBER() OVER (ORDER BY id) AS rn
We then apply a modulus function to the result, and only take those rows that return a 0
WHERE rn % 100 == 0;

This filters to only 1 in every 100 rows.

This method works in T-SQL, and just as well in U-SQL.

— b) U-SQL SAMPLE

However, there is an easier way. U-SQL contains the ‘SAMPLE’ clause that automates this process. Thanks to Paul (T|B) for spotting this beauty.
SAMPLE Syntax:
SELECT xx FROM xx [SAMPLE [ANY (number of rows) | UNIFORM (percentage of rows)]]

There are two sampling options here, ANY and UNIFORM.

After not being able to find anything on the tinterwebs about them I ran some tests to see what they did and how well do these methods work compared to each other. The following code runs some U-SQL over a simple two column csv file containing an arbitrary id and a name. The 640MB file contains 400 names, each repeated a number of times to build 40m rows. Names were repeated using a normal frequency distribution pattern to make the data more representative of real world data.

To assess the output we can look at the distribution of the sampled data to see how closely it correlates to the distribution of the original dataset.

The U-SQL code looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
DECLARE @INPUT_FILE string = @"/AWPF_Demo/Data/names.csv" ;   //40601850 ROWS
DECLARE @OUTPUT_FILE string = @"/AWPF_Demo/Data/names_results.csv" ;
 
@rawdata = 
    EXTRACT 
        id INT,
        name string
    FROM @INPUT_FILE 
    USING Extractors.Text();
 
//--Allocate each row a row number, and take 1 in every 100
@sequenceddata = 
    SELECT *
        , ROW_NUMBER() OVER (ORDER BY id) AS rn
    FROM @rawdata;
 
 @sampleddata1 = 
    SELECT id, name
    FROM @sequenceddata
    WHERE rn % 100 == 0;
 
 //--Use the SAMPLE ANY clause in U-SQL
 @sampleddata2 = 
    SELECT id, name
    FROM @rawdata
    SAMPLE ANY(406018); //--manually calculated as 1% of the input row count
 
 //--Use the SAMPLE UNIFORM clause in U-SQL
 @sampleddata3 = 
    SELECT id, name
    FROM @rawdata
    SAMPLE UNIFORM(0.01); //--1%
 
//--Find the name distribution of the original data, and sampled datasets
@outputbaseline = 
    SELECT name, COUNT(*) AS Qty
    FROM @rawdata
    GROUP BY name;
 
@outputdata1 =
    SELECT name, COUNT(*) AS Qty
    FROM @sampleddata1
    GROUP BY name;
 
@outputdata2 =
    SELECT name, COUNT(*) AS Qty
    FROM @sampleddata2
    GROUP BY name;
 
@outputdata3 =
    SELECT name, COUNT(*) AS Qty
    FROM @sampleddata3
    GROUP BY name;
 
//--Join all datasets together for analysis
@Output = 
    SELECT b.name
        , b.Qty AS QtyOrig
        , o1.Qty AS QtyMod
        , o2.Qty AS QtyANY
        , o3.Qty AS QtyUNIFORM
    FROM @outputbaseline AS b
        LEFT JOIN @outputdata1 AS o1 ON o1.name==b.name
        LEFT JOIN @outputdata2 AS o2 ON o2.name==b.name
        LEFT JOIN @outputdata3 AS o3 ON o3.name==b.name;
 
//--Output the data
OUTPUT @Output
TO @OUTPUT_FILE
ORDER BY QtyOrig DESC
USING Outputters.Text(quoting:TRUE);

So what happens when we take the resulting data and plot the sampled distributions against each other?

  • The Blue line is the number of times each name appears in the original dataset (on the right axis).
  • The Orange line is the distribution from the ROW_NUMBER() and Modulus.
  • The Yellow line is using U-SQL’s SAMPLE UNIFORM.
  • The Grey line is using U-SQL’s SAMPLE ANY.

As you can see, the SAMPLE ANY is a terrible option to maintain data accuracy. In effect it looks like it just takes the top x rows from the file and discards the rest, which I explained earlier is a bad idea.
However the ROW_NUMBER/Mod and the SAMPLE UNIFORM approaches are both staggeringly accurate to the original, with variances +/-2% for each name. This isn’t any good for exact numerical calculations (total sales £ for example), but for looking at trends over very large datasets this sampling approach is a good option.

So, should you use ROW_NUMBER/Mod or SAMPLE UNIFORM? Obviously SAMPLE UNIORM is simpler code, but how do they perform compared with each other?

  • The ROW_NUMBER/Mod approach, using the above dataset used a single vertex, with a total compute time of 29s, read 640MB and wrote 5KB.
  • The SAMPLE ANY approach used two vertices, with a combined compute time of 2s, read 34MB and wrote 5KB.
  • The SAMPLE UNIFORM approach used four vertices, with a combined compute time of 26s, read 766MB and wrote 5KB.

So the SAMPLE ANY, although poor for data consistency allows a much faster execution by only reading a small section of the data.
The ROW_NUMBER/Mod and SAMPLE UNIFORM approaches are very comparable in terms of performance, so it wouldn’t surprise me if they were doing something similar under the hood. However out of simplicity I’d recommend the SAMPLE UNIFORM method.

Frog-Blog-Out

Showing December to December to see opening values for a year in SSAS/MDX

I came across an interesting MDX challenge this week; within a cube’s Date dimension, how to show December twice, once where it should be and again as the opening value for the following year. i.e. for each year I need to show Dec (prev yr), Jan, …, Nov, Dec.

Why? Well if you consider the following pivot chart, you can clearly see growth from Jan to Feb, Feb to Mar, etc., but it’s very difficult to see the growth between Dec and Jan.

So to make things easier to a user we want to repeat the Dec value and insert it before Jan as an opening value for the year. Now we can’t duplicate December’s data in the DW or Cube for obvious reasons, so we need to create a virtual copy of December using some MDX.

Step 1 – Create a new virtual month member called ‘Open’

1
CREATE MEMBER CURRENTCUBE.[Date].[Month Name].[Open] AS null;

Step 2 – Create a named set ‘Dec to Dec’ that includes Open as well as Jan-Dec

1
2
3
4
CREATE STATIC SET CURRENTCUBE.[Dec to Dec]
AS {[Date].[Month Name].[Open]
,[Date].[Month Name].[Month Name].[Jan]
:[Date].[Month Name].[Month Name].[Dec]};

Step 3 – Set the ‘Open’ member to show the value for ‘Dec’ in the previous year

1
2
3
SCOPE ([Date].[Month Name].[Open]);
THIS = ([Date].[Year].PREVMEMBER, [Date].[Month Name].[Dec]);
END SCOPE;

 

When you then select the ‘Dec to Dec’ named set in Excel we do indeed see 13 months per year, including our new ‘Open’ month. There are two problems however:

  1. Open appears at the end not the start. To fix this open up the ‘Field Settings’ of the ‘Dec to Dec’ field in the pivot table, select the ‘Layout and Print’ tab and untick ‘Automatically order and remove duplicates from the set’
  2. If you filter the pivot to only show a single year, Excel decides to hide the ‘Open’ member (it is Excel, as MDX works fine!). To fix this go to ‘PivotTable Options’ then on the ‘totals and Filters’ tab, untick ‘Include filtered items in set totals’.

Hey presto, your pivot table will now look something like this:

You can now clearly see the growth in Dec to Jan, and everyone’s happy.

<Frog-Blog Out>

SQL Server 2016 Licensing Changes, SP1, v.Next CTP1, etc.

connect2016By now you’ve probably heard the exciting news announced at today’s Microsoft Connect conference; SQL Server licensing is changing, significantly, from SQL 2016 SP1.

The functionality of SQL Server Standard, Web & Express editions are being brought up to [almost, 99%] match that of Enterprise Edition. This has a number of critical impacts:

  1. Developers can write applications using a single set of SQL Server functionality, and can now make use of previously Enterprise only features without imposing the cost on their customers.
  2. The customer now decides whether to use Express, Web, Std or Ent based on scaleability, not functionality.
  3. It follows the pricing model already used by most cloud services; pay for usage rather than features.
  4. This brings security to the masses – possibly one of the biggest impacts is that now we can use Row Level Security, Dynamic Data Masking and Always Encrypted in ALL editions of the engine – huge news.

sqlserverThere are some limitations to this though:

  1. It currently only applies to the SQL engine, not to SSIS, SSAS or SSRS – Please Microsoft, follow suit with the BI tools asap!!!
  2. Some features are not [yet] available in the Express and/or Local DB editions, including Change Data Capture, In Memory OLTP and Polybase to name a few.
  3. The trigger to move to Enterprise will now be [almost] purely driven by size and performance requirements.

As if that isn’t enough, other huge announcements include:

  • SQL Server 2016 SP1 is now available
  • New functionality in 2016 SP1, including
    • CREATE OR ALTER functionality for Stored Procs, Views, Functions, etc.
    • DBCC CLONEDATABASE – To easily clone a database schema without the data (love this!!)
    • NUMA support for Analysis Services Tabular – I’ll be checking this out in more detail soon and will blog my findings
  • SQL Server v.Next CTP1 (Community Technology Preview) is now available, which includes SQL Server running on Linux! Note that only the engine is available on Linux at this time, not the BI tools but watch this space, the plan is to offer matching functionality on both

 

For full details, check out the Microsoft blog post here

/Frog-Blog-Out

What is U-SQL?

By now you may have heard about U-SQL, the new kid on the query language block. But what is U-SQL? Where is it? What’s it for?

I was lucky enough to be at the 2015 MVP Summit in Redmond, at which one of the sessions was hosted by Michael Rys (@MikeDoesBigData), talking about U-SQL. As it’s creator, there’s no-one better to learn the ropes from. I was pretty much blown away by what it can do and the ease of access, so I’ve spent the last couple of months playing around with it, and doing a few talks at conferences about it.

But there’s a lot of confusion over what it is, so here’s my high level summary of what it is and where it fits, so you can judge whether it’s relevant to you.
U-SQL

What Does U-SQL Look Like?

It’s a hybrid of T-SQL and C#. Think if it as writing a SQL query, but being able to embed the power of C# and the .Net framework.

So you could write a query that looks like this:

1
2
3
4
5
@CustomersClean =
  SELECT *
     ,((CustomerGender=="M" || CustomerGender=="F") ? CustomerGender : "-") AS CustomerGender_Clean,
     ,MyUSQLApp.Udfs.ValidateDate(CustomerDoB) AS CustomerDoB_Clean
  FROM @Customers;

So the basic SELECT * FROM Table syntax is familiar to anyone used to T-SQL. But with the addition of some C# to clean up the Gender and DoB columns. For the DoB, we’re calling a user defined c# function (ValidateDate) that sits in an assembly behind the query.

How do I  use U-SQL?

Data Lake U-SQL is a part of Azure Data Lake. Data Lake is a core component in the Cortana Analytics Suite, Microsoft’s cloud/Azure data platform. Think of Data Lake Storage as a massively scaleable and high performance staging and permanent storage area for all of your structured, semi-structured and non-structured data, based on the Hadoop File System (HDFS). It can store both tables of data and files (csv, text, images, whatever you want to store).

Once your data is held in Data Lake Storage, you can use Data Lake Analytics to do cool stuff with it. This could include ‘Big Data’ processing and analytics, i.e. running large and complex parallel analytics queries over your data, whether that’s MB, GB, TB or even PB. Or it could be transforming and manipulating your data into different structures.

Either way, Data Lake Analytics uses U-SQL as it’s language. At the moment this is the only place that U-SQL exists, but given its power and flexibility I’m hoping it will also have a bright future outside of Azure Data Lake.

You can write and execute U-SQL directly in the Azure Portal, or for a much richer development environment you can install some plugins into Visual Studio 2013 or 2015. I recommend the latter.

Download the Azure SDK for .Net (2.7.2+) here

Download the Data Lake Tools for Visual Studio 2013 or 2015 here

So it’s a Big Data tool?

Yes, and no.

Yes it has the scalability, performance and functionality for Big Data, and this is what it’s designed to do. When you write a simple U-SQL script, the Data Lake Analytics engine takes the query and parallelises it, scaling it up to as many nodes as you want. So it’s scalable. Very scalable. It’s being used now for complex big data analytics jobs with TB of data, and can scale way beyond that.

But it’s more than this, I also see it as a data preparation and transformation tool. If you want an Azure based BI solution, maybe using Azure SQL DW, or even a smaller scale system using Azure SQL DB, then we don’t yet have a PAAS ETL tool in Azure. But now we do! This provides a very powerful and comprehensive platform for all your data transform needs.

What about Azure Data Factory (ADF)? – No, ADF an ETL tool, it’s an orchestration and coordination tool, and isn’t realistic as a data transformation tool.

What about ELT? – Yes that’s fair, to a point. But in my experience the ELT approach actually needs to be ETLT to work properly. The 2nd ‘T’ is taken care of inside the SQL engine, but a lot of source data has to be pre-processed before it can even be landed in a SQL Server table, U-SQL provides that power to us that T-SQL cannot provide.

Summary

I’m not going to go into any more detail here with how to use or write U-SQL, there will be more posts to come on that.

But I hope this provides some clarity on what it is and where it fits in our ever expanding data platform world.

Data Lake went into public preview in October 2015, so it’s ready and waiting for you to have a go!

 

Frog-blog-out

 

PASS Summit and MVP Summit 2015

I’m currently sitting in Seattle-Tacoma airport waiting for BA048 to take me home, and coming to terms with what an amazing couple of weeks I’ve been fortunate to have.

The PASS Summit

SummitBadgeI had the honour of being selected to speak at the PASS Summit last week, the biggest SQL Server conference in the World, with I believe in the region of 5,500 attendees. I’ve been a regular speaker at lots of the European events (SQL Bits, SQL Relay, SQL Saturday, SQL Server Days, user groups, etc.) for years and I love speaking at them all. But this was different. The sheer scale of it, the enthusiasm and interaction of the audience, the chance to visit Seattle for the first time, and the parties. Wow the parties. It’s going to take me some time to sober up after this.

IMG_4960If you want to get a feel for the size, this is a photo of the keynote on day 1. It’s big.

What’s important to note is that during the day the focus is on learning, and there are a very large number of sessions to choose from. At night it’s a chance to network, to meet old friends and new, to talk shop but also to get to know people better. Don’t underestimate the value in going to the parties

summitpartying

My session went very well, with some great feedback and questions from the audience, thanks to all those that came and were part of my first Summit talk. I hope you left with a renewed love for MDX, and I hope you enjoyed it even half as much as I did – I had a blast!

The MVP Summit

On October 1st, something awesome happened – I was made a SQL Server MVP. Having spent my professional life looking up to and learning from the amazing group of MVPs we have in the UK and worldwide, it was a thrilling, humbling and daunting prospect to be invited to join their lofty ranks. I’ve been meaning to blog about it since then but haven’t been able to articulate or even understand what this actually means, apart from a cool blue badge.

MVPBuzzAll of that changed this week at the MVP Summit in Redmond. As I was already heading out to Seattle for the PASS Summit, it was easy to extend my stay by a week, and boy am I glad that I did.

I can’t talk about the technical content due to NDAs etc., but I will do my best to describe what it’s like.

Do you remember the first time you met a SQL Superstar? I do, it was at SQL Bits and I bumped into Jamie Thompson of SSIS guru fame. I didn’t even recognise him as he didn’t resemble his twitter profile pic at the time, but after a short chat I carried on up the stairs with a huge grin on my face, only to walk into a room with Allan Mitchell, Simon Sabin and countless others. I sat there for a while just soaking it all up.

Walking into the MVP Summit this week brought it all back, a thousand times over. I’m sat in a room next to Ola Hallengren, Paul Randel, Kimberly Tripp, Stacia Misner, Itzik Ben-gan, Brent Ozar, JRJ, Kevin Kline (and the list goes on and on and on). Listening to the likes of Joseph Sirosh, Shawn Bice, Mark Souza, etc. – the decision makers of the whole data platform. Then technical sessions from the likes of Michael Rys (U-SQL inventor) and Kasper de Jonge (SSAS), etc.. Not only are they presenting, but they’re asking; asking for input, asking for feedback, asking for details of what we see in the field, what we need in the product suite. I really felt that the whole team opened up and wanted a two way discussion. There’s a reason that SQL Server now tops the Gartner magic quadrant, and listening to these folks talk it’s easy to understand why; their vision and passion is inspiring.

IMG_5085And then, yes, more beers in the evening to continue the discussions.

More discussions on the product suite, discussions on sport, life, beer, cars, etc. etc. Getting to know each other better, whilst continuously learning. More and more learning.

I have to say a huge thank you to the whole product team for giving us so much time, and putting up with all our questions. It was an invaluable experience that I’ll never forget.

Also a big thanks to the other MVPs at the summit for being so welcoming. Especially Chris Testa-O’Neill, for being my guide for the week and helping it feel less daunting!

IMG_5072But the biggest thank you must go to Jennifer Moser for coordinating and planning it all so well. Not just the summit, but the activities around it; the scheduling, the transport, the pub crawl, access to the product group, and of course the inaugural MVPs Vs Microsoft American Football match.

Yes that’s right. On the Saturday before the summit, we braved the elements and headed to a local sports field for a game of flag football. Or at least ‘intentional’ flag football, but Mark Souza’s two broken ribs (and Simon Sabin’s subsequent reallocation to the Access MVP programme) are evidence that it was not entirely as contact-less as it should have been!

Mark’s team well and truly beat us, but we put up a valiant fight. Secretly we think that Mark had employed a few college football pros onto the SQL Server team just for this match. I couldn’t walk for a few days after, but we all had an awesome time.

What next?

My brain is well and truly fried. I need time to digest it all, and then I’ll be writing a few blog posts about what I can share.

But for now I’m just hoping to get some much needed sleep on the overnight flight back to Heathrow, and really looking forward to seeing my wife, Hollie, who has been amazing in supporting me and enabling me to get out here. Thanks Hol x.

So how would I sum it all up? I’m tired, drained, jet-lagged, brain-fried, hungover and my feet are falling off. But I loved every second of it.

Frog-Blog-Out

MDX Compare or Rank Similar Members

Or should this be called: “Recreating DAX’s EARLIER function in MDX” – either way, a useful technique that solves a problem…

MDX makes it very easy for us to compare one member against others, using functions such as RANK() etc. But how do we dynamically compare a member against a subset of other members? I came across a customer requirement recently where we had to rank a member against other members that had similar properties, based on other measures. This will probably make more sense with an example…

Taking AdventureWorks, we can easily calculate the sales rank of each product using something like this:

WITH MEMBER [Measures].[Rank] AS
   RANK([Product].[Product].CURRENTMEMBER
       ,[Product].[Product].[Product].MEMBERS
       ,[Measures].[Internet Sales Amount])
SELECT {[Measures].[Internet Sales Amount], [Measures].[Rank]} ON 0
      ,NONEMPTY({[Product].[Product].[Product].MEMBERS}
                ,[Measures].[Internet Sales Amount]) ON 1
FROM [Adventure Works]
WHERE [Date].[Calendar Year].&[2008]

MDXRank1

This ranks each product against all other products. But is that a fair comparison? Comparing sales of a water bottle against a top of the range racing bike. Not really. So how about we instead rank each product against all products within +/-20% of the same cost. So if the water bottle costs £20, then we would rank its sales against all products with a cost between £16 and £24. This gives a more accurate idea of how well each product is performing compared to its peers.

Although to keep the MDX simple here, let’s just say any product within £20.

In AdventureWorks we have [Measures].[Internet Average Unit Price], which can be used to determine comparable products. So how do we go about achieving this?

If we look at the RANK function, it takes three parameters; the member being ranked, the set over which to rank, and the measure to be used.

All we have to do is filter the second parameter, the set over which to rank, to include similar members. So maybe something like this:

WITH MEMBER [Measures].[Rank] AS
   RANK([Product].[Product].CURRENTMEMBER
       ,FILTER([Product].[Product].[Product].MEMBERS
              ,ABS([Measures].[Internet Average Unit Price]
               -([Measures].[Internet Average Unit Price]
                ,[Product].[Product].CURRENTMEMBER))
               <=20
               )
       ,[Measures].[Internet Sales Amount])

If we break this down, we’re just changing the 2nd parameter to be a filtered set, where the unit price is within £20 of the unit price of the current member. This should work right?

Unfortunately, wrong. The results look exactly the same as the original rank – nothing has changed.

MDXRank2

The problem here is that CURRENTMEMBER is within the filter function, so it changes context to refer to whatever row is being considered at the time by the filter function. So [Measures].[Internet Average Unit Price] and ([Measures].[Internet Average Unit Price],[Product].[Product].CURRENTMEMBER) are always the same product, and no rows are filtered out. CURRENTMEMBER does NOT refer to the current member being considered by the RANK function, but by the FILTER function.

In DAX we have the EARLIER and EARLIEST functions, which would be great here, and would allow us to step out of the current context into the previous calculation layer. But unfortunately we haven’t been blessed with an MDX EARLIER function. So how do we fix this in MDX?

The trick here is dynamic sets, using the STRTOSET function. This allows us to grab the member being ranked, and treat it as a fixed member within the FILTER function.

WITH MEMBER [Measures].[Rank] AS
   RANK([Product].[Product].CURRENTMEMBER
       ,STRTOSET('
          FILTER([Product].[Product].[Product].MEMBERS
                ,ABS([Measures].[Internet Average Unit Price]
                 -' + CSTR([Measures].[Internet Average Unit Price]) + ')
                 <=20
                )'
       )
       ,[Measures].[Internet Sales Amount])

We building up a string, which will fix the value of [Measures].[Internet Average Unit Price] to that of the product being ranked, and will then dynamically compare it to the value of [Measures].[Internet Average Unit Price] for all other products. Those within £20 will be included in the resulting set, and will be used to rank the original product.

MDXRankYou can see the result in the screenshot above, where the ranking is dependent on the average unit price.

Frog-Blog Out

1 2 3 11
Power BI Sentinel
The Frog Blog

Team Purple Frog specialise in designing and implementing Microsoft Data Analytics solutions, including Data Warehouses, Cubes, SQL Server, SSIS, ADF, SSAS, Power BI, MDX, DAX, Machine Learning and more.

This is a collection of thoughts, ramblings and ideas that we think would be useful to share.

Authors:

Alex Whittles
(MVP)
Reiss McSporran
Jeet Kainth
Jon Fletcher

Data Platform MVP

Power BI Sentinel
Frog Blog Out
twitter
rssicon