0845 643 64 63

Big Data

Sampling data in Data Lake U-SQL for Power BI

Being able to hook Power BI directly into Azure Data Lake Storage (ADLS) is a very powerful tool (and it will be even more so when you can link to ADLS files that are in a different Azure account!! – not yet available as at January 2017). However there is a problem, Data Lake is designed to scale to petabytes of data whereas Power BI has a 10GB limit. Yes this is compressed, so we’d expect around 100GB of raw data, however to load this you need 100GB+ of RAM available on your PC, so it’s hard to actually reach the limit with a single dataset.

There’s obviously a disconnect in scalability here. In some datasets we can just use U-SQL to aggregate the data and pre-summarise by the list of fields that we actually want to analyse, and this is fine for additive transactional data. However if we need a many to many link or the granular details of individual rows of data then there’s an issue, how to we get this data into Power BI?

The answer is sampling, we don’t bring in 100% of the data, but maybe 10%, or 1%, or even 0.01%, it depends how much you need to reduce your dataset. It is however critical to know how to sample data correctly in order to maintain a level of accuracy of data in your reports.

Option 1: Take the top x rows of data
Don’t do it. Ever. Just no.
What if the source data you’ve been given is pre-sorted by product or region, you’d end up with only data from products starting with ‘a’, which would give you some wildly unpredictable results.

Option 2: Take a random % sample
Now we’re talking. This option will take, for example 1 in every 100 rows of data, so it’s picking up an even distribution of data throughout the dataset. This seems a much better option, so how do we do it?

— a) Use ROW_NUMBER() and Modulus
One option would be to include a ROW_NUMBER() windowing function in a U-SQL query that allocates each row a unique number.
ROW_NUMBER() OVER (ORDER BY id) AS rn
We then apply a modulus function to the result, and only take those rows that return a 0
WHERE rn % 100 == 0;

This filters to only 1 in every 100 rows.

This method works in T-SQL, and just as well in U-SQL.

— b) U-SQL SAMPLE

However, there is an easier way. U-SQL contains the ‘SAMPLE’ clause that automates this process. Thanks to Paul (T|B) for spotting this beauty.
SAMPLE Syntax:
SELECT xx FROM xx [SAMPLE [ANY (number of rows) | UNIFORM (percentage of rows)]]

There are two sampling options here, ANY and UNIFORM.

After not being able to find anything on the tinterwebs about them I ran some tests to see what they did and how well do these methods work compared to each other. The following code runs some U-SQL over a simple two column csv file containing an arbitrary id and a name. The 640MB file contains 400 names, each repeated a number of times to build 40m rows. Names were repeated using a normal frequency distribution pattern to make the data more representative of real world data.

To assess the output we can look at the distribution of the sampled data to see how closely it correlates to the distribution of the original dataset.

The U-SQL code looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
DECLARE @INPUT_FILE string = @"/AWPF_Demo/Data/names.csv" ;   //40601850 ROWS
DECLARE @OUTPUT_FILE string = @"/AWPF_Demo/Data/names_results.csv" ;
 
@rawdata = 
    EXTRACT 
        id INT,
        name string
    FROM @INPUT_FILE 
    USING Extractors.Text();
 
//--Allocate each row a row number, and take 1 in every 100
@sequenceddata = 
    SELECT *
        , ROW_NUMBER() OVER (ORDER BY id) AS rn
    FROM @rawdata;
 
 @sampleddata1 = 
    SELECT id, name
    FROM @sequenceddata
    WHERE rn % 100 == 0;
 
 //--Use the SAMPLE ANY clause in U-SQL
 @sampleddata2 = 
    SELECT id, name
    FROM @rawdata
    SAMPLE ANY(406018); //--manually calculated as 1% of the input row count
 
 //--Use the SAMPLE UNIFORM clause in U-SQL
 @sampleddata3 = 
    SELECT id, name
    FROM @rawdata
    SAMPLE UNIFORM(0.01); //--1%
 
//--Find the name distribution of the original data, and sampled datasets
@outputbaseline = 
    SELECT name, COUNT(*) AS Qty
    FROM @rawdata
    GROUP BY name;
 
@outputdata1 =
    SELECT name, COUNT(*) AS Qty
    FROM @sampleddata1
    GROUP BY name;
 
@outputdata2 =
    SELECT name, COUNT(*) AS Qty
    FROM @sampleddata2
    GROUP BY name;
 
@outputdata3 =
    SELECT name, COUNT(*) AS Qty
    FROM @sampleddata3
    GROUP BY name;
 
//--Join all datasets together for analysis
@Output = 
    SELECT b.name
        , b.Qty AS QtyOrig
        , o1.Qty AS QtyMod
        , o2.Qty AS QtyANY
        , o3.Qty AS QtyUNIFORM
    FROM @outputbaseline AS b
        LEFT JOIN @outputdata1 AS o1 ON o1.name==b.name
        LEFT JOIN @outputdata2 AS o2 ON o2.name==b.name
        LEFT JOIN @outputdata3 AS o3 ON o3.name==b.name;
 
//--Output the data
OUTPUT @Output
TO @OUTPUT_FILE
ORDER BY QtyOrig DESC
USING Outputters.Text(quoting:TRUE);

So what happens when we take the resulting data and plot the sampled distributions against each other?

  • The Blue line is the number of times each name appears in the original dataset (on the right axis).
  • The Orange line is the distribution from the ROW_NUMBER() and Modulus.
  • The Yellow line is using U-SQL’s SAMPLE UNIFORM.
  • The Grey line is using U-SQL’s SAMPLE ANY.

As you can see, the SAMPLE ANY is a terrible option to maintain data accuracy. In effect it looks like it just takes the top x rows from the file and discards the rest, which I explained earlier is a bad idea.
However the ROW_NUMBER/Mod and the SAMPLE UNIFORM approaches are both staggeringly accurate to the original, with variances +/-2% for each name. This isn’t any good for exact numerical calculations (total sales £ for example), but for looking at trends over very large datasets this sampling approach is a good option.

So, should you use ROW_NUMBER/Mod or SAMPLE UNIFORM? Obviously SAMPLE UNIORM is simpler code, but how do they perform compared with each other?

  • The ROW_NUMBER/Mod approach, using the above dataset used a single vertex, with a total compute time of 29s, read 640MB and wrote 5KB.
  • The SAMPLE ANY approach used two vertices, with a combined compute time of 2s, read 34MB and wrote 5KB.
  • The SAMPLE UNIFORM approach used four vertices, with a combined compute time of 26s, read 766MB and wrote 5KB.

So the SAMPLE ANY, although poor for data consistency allows a much faster execution by only reading a small section of the data.
The ROW_NUMBER/Mod and SAMPLE UNIFORM approaches are very comparable in terms of performance, so it wouldn’t surprise me if they were doing something similar under the hood. However out of simplicity I’d recommend the SAMPLE UNIFORM method.

Frog-Blog-Out

What is U-SQL?

By now you may have heard about U-SQL, the new kid on the query language block. But what is U-SQL? Where is it? What’s it for?

I was lucky enough to be at the 2015 MVP Summit in Redmond, at which one of the sessions was hosted by Michael Rys (@MikeDoesBigData), talking about U-SQL. As it’s creator, there’s no-one better to learn the ropes from. I was pretty much blown away by what it can do and the ease of access, so I’ve spent the last couple of months playing around with it, and doing a few talks at conferences about it.

But there’s a lot of confusion over what it is, so here’s my high level summary of what it is and where it fits, so you can judge whether it’s relevant to you.
U-SQL

What Does U-SQL Look Like?

It’s a hybrid of T-SQL and C#. Think if it as writing a SQL query, but being able to embed the power of C# and the .Net framework.

So you could write a query that looks like this:

1
2
3
4
5
@CustomersClean =
  SELECT *
     ,((CustomerGender=="M" || CustomerGender=="F") ? CustomerGender : "-") AS CustomerGender_Clean,
     ,MyUSQLApp.Udfs.ValidateDate(CustomerDoB) AS CustomerDoB_Clean
  FROM @Customers;

So the basic SELECT * FROM Table syntax is familiar to anyone used to T-SQL. But with the addition of some C# to clean up the Gender and DoB columns. For the DoB, we’re calling a user defined c# function (ValidateDate) that sits in an assembly behind the query.

How do I  use U-SQL?

Data Lake U-SQL is a part of Azure Data Lake. Data Lake is a core component in the Cortana Analytics Suite, Microsoft’s cloud/Azure data platform. Think of Data Lake Storage as a massively scaleable and high performance staging and permanent storage area for all of your structured, semi-structured and non-structured data, based on the Hadoop File System (HDFS). It can store both tables of data and files (csv, text, images, whatever you want to store).

Once your data is held in Data Lake Storage, you can use Data Lake Analytics to do cool stuff with it. This could include ‘Big Data’ processing and analytics, i.e. running large and complex parallel analytics queries over your data, whether that’s MB, GB, TB or even PB. Or it could be transforming and manipulating your data into different structures.

Either way, Data Lake Analytics uses U-SQL as it’s language. At the moment this is the only place that U-SQL exists, but given its power and flexibility I’m hoping it will also have a bright future outside of Azure Data Lake.

You can write and execute U-SQL directly in the Azure Portal, or for a much richer development environment you can install some plugins into Visual Studio 2013 or 2015. I recommend the latter.

Download the Azure SDK for .Net (2.7.2+) here

Download the Data Lake Tools for Visual Studio 2013 or 2015 here

So it’s a Big Data tool?

Yes, and no.

Yes it has the scalability, performance and functionality for Big Data, and this is what it’s designed to do. When you write a simple U-SQL script, the Data Lake Analytics engine takes the query and parallelises it, scaling it up to as many nodes as you want. So it’s scalable. Very scalable. It’s being used now for complex big data analytics jobs with TB of data, and can scale way beyond that.

But it’s more than this, I also see it as a data preparation and transformation tool. If you want an Azure based BI solution, maybe using Azure SQL DW, or even a smaller scale system using Azure SQL DB, then we don’t yet have a PAAS ETL tool in Azure. But now we do! This provides a very powerful and comprehensive platform for all your data transform needs.

What about Azure Data Factory (ADF)? – No, ADF an ETL tool, it’s an orchestration and coordination tool, and isn’t realistic as a data transformation tool.

What about ELT? – Yes that’s fair, to a point. But in my experience the ELT approach actually needs to be ETLT to work properly. The 2nd ‘T’ is taken care of inside the SQL engine, but a lot of source data has to be pre-processed before it can even be landed in a SQL Server table, U-SQL provides that power to us that T-SQL cannot provide.

Summary

I’m not going to go into any more detail here with how to use or write U-SQL, there will be more posts to come on that.

But I hope this provides some clarity on what it is and where it fits in our ever expanding data platform world.

Data Lake went into public preview in October 2015, so it’s ready and waiting for you to have a go!

 

Frog-blog-out

 

The Frog Blog

I'm Alex Whittles.

I specialise in designing and implementing SQL Server business intelligence solutions, and this is my blog! Just a collection of thoughts, techniques and ramblings on SQL Server, Cubes, Data Warehouses, MDX, DAX and whatever else comes to mind.

Data Platform MVP

Frog Blog Out
twitter
rssicon