What is U-SQL?
By now you may have heard about U-SQL, the new kid on the query language block. But what is U-SQL? Where is it? What’s it for?
I was lucky enough to be at the 2015 MVP Summit in Redmond, at which one of the sessions was hosted by Michael Rys (@MikeDoesBigData), talking about U-SQL. As it’s creator, there’s no-one better to learn the ropes from. I was pretty much blown away by what it can do and the ease of access, so I’ve spent the last couple of months playing around with it, and doing a few talks at conferences about it.
But there’s a lot of confusion over what it is, so here’s my high level summary of what it is and where it fits, so you can judge whether it’s relevant to you.
What Does U-SQL Look Like?
It’s a hybrid of T-SQL and C#. Think if it as writing a SQL query, but being able to embed the power of C# and the .Net framework.
So you could write a query that looks like this:
1 2 3 4 5
@CustomersClean = SELECT * ,((CustomerGender=="M" || CustomerGender=="F") ? CustomerGender : "-") AS CustomerGender_Clean, ,MyUSQLApp.Udfs.ValidateDate(CustomerDoB) AS CustomerDoB_Clean FROM @Customers;
So the basic SELECT * FROM Table syntax is familiar to anyone used to T-SQL. But with the addition of some C# to clean up the Gender and DoB columns. For the DoB, we’re calling a user defined c# function (ValidateDate) that sits in an assembly behind the query.
How do I use U-SQL?
U-SQL is a part of Azure Data Lake. Data Lake is a core component in the Cortana Analytics Suite, Microsoft’s cloud/Azure data platform. Think of Data Lake Storage as a massively scaleable and high performance staging and permanent storage area for all of your structured, semi-structured and non-structured data, based on the Hadoop File System (HDFS). It can store both tables of data and files (csv, text, images, whatever you want to store).
Once your data is held in Data Lake Storage, you can use Data Lake Analytics to do cool stuff with it. This could include ‘Big Data’ processing and analytics, i.e. running large and complex parallel analytics queries over your data, whether that’s MB, GB, TB or even PB. Or it could be transforming and manipulating your data into different structures.
Either way, Data Lake Analytics uses U-SQL as it’s language. At the moment this is the only place that U-SQL exists, but given its power and flexibility I’m hoping it will also have a bright future outside of Azure Data Lake.
You can write and execute U-SQL directly in the Azure Portal, or for a much richer development environment you can install some plugins into Visual Studio 2013 or 2015. I recommend the latter.
So it’s a Big Data tool?
Yes, and no.
Yes it has the scalability, performance and functionality for Big Data, and this is what it’s designed to do. When you write a simple U-SQL script, the Data Lake Analytics engine takes the query and parallelises it, scaling it up to as many nodes as you want. So it’s scalable. Very scalable. It’s being used now for complex big data analytics jobs with TB of data, and can scale way beyond that.
But it’s more than this, I also see it as a data preparation and transformation tool. If you want an Azure based BI solution, maybe using Azure SQL DW, or even a smaller scale system using Azure SQL DB, then we don’t yet have a PAAS ETL tool in Azure. But now we do! This provides a very powerful and comprehensive platform for all your data transform needs.
What about Azure Data Factory (ADF)? – No, ADF an ETL tool, it’s an orchestration and coordination tool, and isn’t realistic as a data transformation tool.
What about ELT? – Yes that’s fair, to a point. But in my experience the ELT approach actually needs to be ETLT to work properly. The 2nd ‘T’ is taken care of inside the SQL engine, but a lot of source data has to be pre-processed before it can even be landed in a SQL Server table, U-SQL provides that power to us that T-SQL cannot provide.
I’m not going to go into any more detail here with how to use or write U-SQL, there will be more posts to come on that.
But I hope this provides some clarity on what it is and where it fits in our ever expanding data platform world.
Data Lake went into public preview in October 2015, so it’s ready and waiting for you to have a go!