U-SQL

Business Intelligence in Azure – SQLBits 2018 Precon

What can you expect from my SQLBits pre conference training day in February 2018 at the London Olympia?

Well my friends, in short, we are going to take whirlwind tour of the entire business intelligence stack of services in Azure. No stone will be left unturned. No service will be left without scalability. We’ll cover them all and we certainly aren’t going to check with the Azure bill payer before turning up the compute on our data transforms.



What will we actually cover?

With new cloud services and advancements in locally hosted platforms developing a lambda architecture is becoming the new normal. In this full day of high level training we’ll learn how to architect hybrid business intelligence solutions using Microsoft Azure offerings. We’ll explore the roles of these cloud data services and how to make them work for you in this complete overview of business intelligence on the Microsoft cloud data platform.

Here’s how we’ll break that down during the day…

Module 1 – Getting Started with Azure

Using platform as a service products is great, but let’s take a step back. To kick off we’ll cover the basics for deploying and managing your Azure services. Navigating the Azure portal and building dashboards isn’t always as intuitive as we’d like. What’s a resource group? And why is it important to understand your Azure Activity Directory tenant?

Module 2 – An Overview of BI in Azure

What’s available for the business intelligence architect in the cloud and how might these services relate to traditional on premises ETL and cube data flows. Is ETL enough for our big unstructured data sources or do we need to mix things up and add some more letters to the acronym in the cloud?

Module 3 – Databases in Azure (SQL DB, SQL DW, Cosmos DB, SQL MI)

It’s SQL Server Jim, but not as we know it. Check out the PaaS flavours of our long term on premises friends. Can we trade the agent and an operating system for that sliding bar of scalable compute? DTU and DWU are here to stay with new SLA’s relating to throughput. Who’s on ACID and as BI people do we care?

Module 4 – The Azure Machines are here to Learn

Data scientist or developer? Azure Machine Learning was designed for applied machine learning. Use best-in-class algorithms in a simple drag-and-drop interface. We’ll go from idea to deployment in a matter of clicks. Without a terminator in sight!

Module 5 – Swimming in the Data Lake with U-SQL

Let’s understand the role of this hyper-scale two tier big data technology and how to harness its power with U-SQL, the offspring of T-SQL and C#. We’ll cover everything you need to know to get started developing solutions with Azure Data Lake.

Module 6 – IoT, Event Hubs and Azure Stream Analytics

Real-time data is everywhere. We need to use it and unlock it as a rich source of information that can be channelled to react to events, produce alerts from sensor values or in 9000 other scenarios. In this module, we’ll learn how, using Azure messaging hubs and Azure Stream Analytics.

Module 7 – Power BI, our Sematic Layer, is it All Things to All People?

Combining all our data sources in one place with rich visuals and a flexible data modelling tool. Power BI takes it all, small data, big data, streaming data, website content and more. But we really need a Venn diagram to decide when/where it’s needed.

Module 8 – Data Integration with Azure Data Factory and SSIS

The new integration runtime is here. But how do we unlock the scale out potential of our control flow and data flow? Let’s learn to create the perfect dependency driven pipeline for our data flows. Plus, how to work with the Azure Batch Service should you need that extensibility.

 

Finally we’ll wrap up the day by playing the Azure icon game, which you’ll all now be familiar with and able to complete with a perfect score having completed this training day 🙂

Many thanks for reading and I hope to see you in February, its going to be magic 😉

Register now: https://www.regonline.com/registration/Checkin.aspx?EventID=2023328

All training day content is subject to change, dependant on timings and the demo gods will!


 

Azure Data Lake – The Services. The U-SQL. The C# (Reference Guide)

This post is a reference guide to support an event talk or webinar. The content is intended to assist the audience only. Thank you.

Abstract

How do we implement Azure Data Lake? How does a lake fit into our data platform architecture? Is Data Lake going to run in isolation or be part of a larger pipeline? How do we use and work with USQL? Does size matter?! The answers to all these questions and more in this session as we immerse ourselves in the lake, that’s in a cloud. We’ll take an end to end look at the components and understand why the compute and storage are separate services. For the developers, what tools should we be using and where should we deploy our USQL scripts. Also, what options are available for handling our C# code behind and supporting assemblies. We’ll cover everything you need to know to get started developing data solutions with Azure Data Lake. Finally, let’s extend the U-SQL capabilities with the Microsoft Cognitive Services!

Links

What is Azure Data Lake? The Microsoft version.
https://azure.microsoft.com/en-gb/solutions/data-lake/

Understanding the ADL Analytics Unit
https://blogs.msdn.microsoft.com/azuredatalake/2016/10/12/understanding-adl-analytics-unit/

Why use Azure Data Lake? The Microsoft version.
https://azure.microsoft.com/en-gb/solutions/data-lake/

Comsuming Data Lake with Power – Cross tenant data refreshes.
https://www.purplefrogsystems.com/paul/2017/06/connecting-power-bi-to-azure-data-lake-store-across-tenants/

U-SQL String Data Type 128KB Limit
https://feedback.azure.com/forums/327234-data-lake/suggestions/13416093-usql-string-data-type-has-a-size-limit-of-128kb

Creating a U-SQL Merge Statement
https://www.purplefrogsystems.com/paul/2016/12/writing-a-u-sql-merge-statement/

U-SQL Looping
https://www.purplefrogsystems.com/paul/2017/05/recursive-u-sql-with-powershell-u-sql-looping/

U-SQL Date Dimension
https://www.purplefrogsystems.com/paul/2017/02/creating-a-u-sql-date-dimension-numbers-table-in-azure-data-lake/

Further Reading

Microsoft Blog – An Introduction to U-SQL in Azure Data Lake
https://blogs.msdn.microsoft.com/robinlester/2016/01/04/an-introduction-to-u-sql-in-azure-data-lake/

Microsoft Documentation – U-SQL Programmability Guide
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-programmability-guide

Microsoft MSDN – U-SQL Language Reference
https://msdn.microsoft.com/en-US/library/azure/mt591959(Azure.100).aspx

SQL Server Central – Stairway to U-SQL
http://www.sqlservercentral.com/stairway/142480/

Stack Overflow – U-SQL Tag
http://stackoverflow.com/questions/tagged/u-sql

 

Cognitive services with U-SQL in Azure Data Lake

https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-cognitive


Recursive U-SQL With PowerShell (U-SQL Looping)

In its natural form U-SQL does not support recursive operations and for good reason. This is a big data, scale out, declarative language where the inclusion of procedural, iterative code would be very unnatural. That said, if you must pervert things PowerShell can assist with the looping and dare I say the possibility for dynamic U-SQL.

A couple of caveats…

  • From the outset, I accept this abstraction with PowerShell to achieve some iterative process in U-SQL is a bit of a hack and very inefficient, certainly in the below example.
  • The scenario here is not perfect and created using a simple situation for the purposes of explanation only. I know the same data results could be achieved just by extending the aggregate grouping!

Hopefully that sets the scene. As I’m writing, I’m wondering if this blog post will be hated by the purists out there. Or loved by the abstraction hackers. I think I’m both 🙂

Scenario

As with most of my U-SQL blog posts I’m going to start with the Visual Studio project available as part of the data lake tools extension called ‘U-SQL Sample Application’. This gives us all the basic start up code and sample data to get going.

Input: within the solution (Samples > Data > Ambulance Data) we have some CSV files for vehicles. These are separated into 16 sources datasets covering 4 different vehicle ID’s across 4 days.

Output: let’s say we have a requirement to find out the average speed of each vehicle per day. Easily enough with a U-SQL wildcard on the extractor. But we also want to output a single file for each day of data. Not so easy, unless we write 1 query for each day of data. Fine with samples only covering 4 days, not so fine with 2 years of records split by vehicle.

Scenario set, lets look at how we might do this.

The U-SQL

To produce the required daily outputs I’m going to use a U-SQL query in isolation to return a distinct list of dates across the 16 input datasets, plus a parameterised U-SQL stored procedure to do the aggregation and output a single day of data.

First getting the dates. The below simply returns a text file containing a distinct list of all the dates in our source data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
DECLARE @InputPath string = "/Samples/Data/AmbulanceData/{filename}";
 
@DATA =
    EXTRACT 
        [vehicle_id] INT,
        [entry_id] long,
        [event_date] DateTime,
        [latitude] FLOAT,
        [longitude] FLOAT,
        [speed] INT,
        [direction] string,
        [trip_id] INT?,
        [filename] string
    FROM 
        @InputPath
    USING 
        Extractors.Csv();
 
@DateList =
    SELECT DISTINCT 
        [event_date].ToString("yyyyMMdd") AS EventDateList
    FROM 
        @DATA;
 
OUTPUT @DateList
TO "/output/AmbulanceDataDateList.txt"
USING Outputters.Csv(quoting : FALSE, outputHeader : FALSE);

Next, the below stored procedure. This uses the same input files, but does the required aggregation and outputs a daily file matching the parameter passed giving us a sinlge output.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
CREATE PROCEDURE IF NOT EXISTS [dbo].[usp_OutputDailyAvgSpeed]
    (
    @OutputDate string
    )
AS
BEGIN
 
    //DECLARE @OutputDate string = "20140914"; //FOR dev
    DECLARE @InputPath string = "/Samples/Data/AmbulanceData/{filename}";
    DECLARE @OutputPath string = "/output/DailyRecords/VehicleAvgSpeed_" + @OutputDate + ".csv";
 
    @DATA =
        EXTRACT 
            [vehicle_id] INT,
            [entry_id] long,
            [event_date] DateTime,
            [latitude] FLOAT,
            [longitude] FLOAT,
            [speed] INT,
            [direction] string,
            [trip_id] INT?,
            [filename] string
        FROM 
            @InputPath
        USING 
            Extractors.Csv();
 
    @VAvgSpeed =
        SELECT DISTINCT 
            [vehicle_id],
            AVG([speed]) AS AverageSpeed
        FROM 
            @DATA
        WHERE
            [event_date].ToString("yyyyMMdd") == @OutputDate
        GROUP BY
            [vehicle_id];
 
    OUTPUT @VAvgSpeed
    TO @OutputPath
    USING Outputters.Csv(quoting : TRUE, outputHeader : TRUE);
 
END;

At this point, we could just execute the stored procedures with each required date, manually crafted from the text file. Like this:

1
2
3
4
[dbo].[usp_OutputDailyAvgSpeed]("20140914");
[dbo].[usp_OutputDailyAvgSpeed]("20140915");
[dbo].[usp_OutputDailyAvgSpeed]("20140916");
[dbo].[usp_OutputDailyAvgSpeed]("20140917");

Fine, for small amounts of data, but we can do better for larger datasets.

Enter PowerShell and some looping.

The PowerShell

As with all things Microsoft PowerShell is our friend and the supporting cmdlets for the Azure Data Lake services are no exception. I recommend these links if you haven’t yet written some PowerShell to control ADL Analytics jobs or upload files to ADL Storage.

Moving on. How can PowerShell help us script our data output requirements? Well, here’s the answer, in my PowerShell script below I’ve done the following.

  1. Authenticate against my Azure subscription (optionally create yourself a PSCredential to do this).
  2. Submit the first U-SQL query as a file to return the distinct list of dates.
  3. Wait for the ADL Analytics job to complete.
  4. Download the output text file from ADL storage.
  5. Read the contents of the text file.
  6. Iterate over each dates listed in the text file.
  7. Submit a U-SQL job for each stored procedure with the date passed from the list.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#Params...
$WhereAmI = $MyInvocation.MyCommand.Path.Replace($MyInvocation.MyCommand.Name,"")
 
$DLAnalyticsName = "myfirstdatalakeanalysis" 
$DLAnalyticsDoP = 10
$DLStoreName = "myfirstdatalakestore01"
 
 
#Create Azure Connection
Login-AzureRmAccount | Out-Null
 
$USQLFile = $WhereAmI + "RecursiveOutputPrep.usql"
$PrepOutput = $WhereAmI + "AmbulanceDataDateList.txt"
 
#Summit Job
$job = Submit-AzureRmDataLakeAnalyticsJob `
    -Name "GetDateList" `
    -AccountName $DLAnalyticsName `
    –ScriptPath $USQLFile `
    -DegreeOfParallelism $DLAnalyticsDoP
 
Write-Host "Submitted USQL prep job."
 
#Wait for job to complete
Wait-AdlJob -Account $DLAnalyticsName -JobId $job.JobId | Out-Null
 
Write-Host "Downloading USQL output file."
 
#Download date list
Export-AzureRmDataLakeStoreItem `
    -AccountName $DLStoreName `
    -Path $myrootdir\output\AmbulanceDataDateList.csv `
    -Destination $PrepOutput | Out-Null
 
Write-Host "Downloaded USQL output file."
 
#Read dates
$Dates = Get-Content $PrepOutput
 
Write-Host "Read date list."
 
#Loop over dates with proc call for each
ForEach ($Date in $Dates)
    {
    $USQLProcCall = '[dbo].[usp_OutputDailyAvgSpeed]("' + $Date + '");'
    $JobName = 'Output daily avg dataset for ' + $Date
 
    Write-Host $USQLProcCall
 
    $job = Submit-AzureRmDataLakeAnalyticsJob `
        -Name $JobName `
        -AccountName $DLAnalyticsName `
        –Script $USQLProcCall `
        -DegreeOfParallelism $DLAnalyticsDoP
 
    Write-Host "Job submitted for " $Date
    }
 
Write-Host "Script complete. USQL jobs running."

At this point I think its worth reminding you of my caveats above 🙂

I would like to point out the flexibility in the PowerShell cmdlet Submit-AzureRmDataLakeAnalyticsJob. Allowing us to pass a U-SQL file (step 2) or build up a U-SQL string dynamically within the PowerShell script and pass that as the execution code (step 7), very handy. Switches: -Script or -ScriptPath.

If all goes well you should have jobs being prepared and shortly after running to produce the daily output files.

I used 10 AU’s for my jobs because I wanted to burn up some old Azure credits, but you can change this in the PowerShell variable $DLAnalyticsDoP.

Conclusion

It’s possible to archive looping behaviour with U-SQL when we want to produce multiple output files, but only when we abstract the iterative behaviour away to our friend PowerShell.

Comments welcome on this approach.

Many thanks for reading.

 

Ps. To make life a little easier. I’ve stuck all of the above code and sample data into a GitHub repostiory to save you copy and pasting things from the code windows above.

https://github.com/mrpaulandrew/RecursiveU-SQLWithPowerShell

 

 


Calling U-SQL Stored Procedures with C# Code Behind

So friends, some more lessons learnt when developing with U-SQL and Azure Data Lake. I’ll try and keep this short.

Problem

You have a U-SQL stored procedure written and working fine within your Azure Data Lake Analytics service. But we need to add some more business logic or something requiring a little C# magic. This is the main thing I love about U-SQL, having that C# code behind file where I can extend my normal SQL behaviour. So, being a happy little developer you write your class and method to support the U-SQL above and you recreate your stored procedure. Great!

Next, you try to run that stored procedure…

[ExampleDatabase].[dbo].[SimpleProc]();

But are hit with an error, similar to this:

E_CSC_USER_INVALIDCSHARP: C# error CS0103: The name ‘SomeNameSpaceForCodeBehind’ does not exist in the current context.


Why?

Submitting U-SQL queries containing C# code behind methods works fine normally. But once you wrap it up as a stored procedure within the ADL analytics database the complied C# is lost. Almost as if the U-SQL file/procedure no longer has its lovely code behind file at all!

Just to be explicit with the issue. Here is an example stored procedure that I’ve modified from the Visual Studio U-SQL Sample Application project. Note my GetHelloWord method that I’ve added just for demonstration purposes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
DROP PROCEDURE IF EXISTS [dbo].[SimpleProc];
 
CREATE PROCEDURE [dbo].[SimpleProc]()
AS
BEGIN
 
    @searchlog =
        EXTRACT UserId INT,
                START DateTime,
                Region string,
                Query string,
                Duration INT?,
                Urls string,
                ClickedUrls string
        FROM "/Samples/Data/SearchLog.tsv"
        USING Extractors.Tsv();
 
    @WithCodeBehind =
        SELECT 
            *,
            SomeNameSpaceForCodeBehind.MyCodeBehind.GetHelloWorld() AS SomeText
        FROM @searchlog;
 
    OUTPUT @WithCodeBehind
    TO "/output/SearchLogResult1.csv"
    USING Outputters.Csv();
 
END;

This U-SQL file then has the following C#, with my totally original naming conventions. No trolls please, this is not the point of this post 🙂

1
2
3
4
5
6
7
8
9
10
11
namespace SomeNameSpaceForCodeBehind
{
    public class MyCodeBehind
    {
        static public string GetHelloWorld()
        {
            string text = "HelloWorld";
            return text;
        }
    }
}

So, this is what doesn’t work. Problem hopefully clearly defined.

Solution

To work around this problem, instead of using a C# code behind file for the procedure we need to move the class into its own assembly. This requires a little more effort and plumbing, but does solve this problem. Plus, this approach is probably more familiar to people that have ever worked with CLR functions in SQL Sever that they want to use within a stored procedure.

This is what we need to do.

  • Add a C# class library to your Visual Studio solution and move the U-SQL code behind into a library name space.

  • Build the library and use the DLL to create an assembly within the ADL analytics database. The DLL can live in your ADL store root, in line it or create it from Azure Blob Store. I have another post on that here if your interested.
CREATE ASSEMBLY IF NOT EXISTS [HelloWorld] FROM "assembly/ClassLibrary1.dll";
  • Finally, modify your stored procedure to use the assembly instead of the code behind name space. The new stored procedure should look like this.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
DROP PROCEDURE IF EXISTS [dbo].[SimpleProc];
 
CREATE PROCEDURE [dbo].[SimpleProc]()
AS
BEGIN
 
    //Complied library:
    REFERENCE ASSEMBLY [HelloWorld];
 
    @searchlog =
        EXTRACT UserId INT,
                START DateTime,
                Region string,
                Query string,
                Duration INT?,
                Urls string,
                ClickedUrls string
        FROM "/Samples/Data/SearchLog.tsv"
        USING Extractors.Tsv();
 
    @WithCodeBehind =
        SELECT *,
               //Changed TO USE assembly:
               HelloWorld.ClassLibrary1.GetHelloWorld() AS SomeText
        FROM @searchlog;
 
    OUTPUT @WithCodeBehind
    TO "/output/SearchLogResult1.csv"
    USING Outputters.Csv();
 
END;

This new procedure executes without error and gets around the problem above.

I hope this helps and allows you to convert those complex U-SQL scripts to procedures, while retaining any valuable code behind functionality.

Many thanks for reading

Storing U-SQL Assemblies in Azure Blob Storage

I’m hoping the title of this post is fairly self explanatory. Your here because like me you found that the MSDN language reference page for creating U-SQL assemblies states that it’s possible to store the DLL’s in Azure Blob Storage. But it doesn’t actually tell you how. Well please continue my friends and I’ll show you how.

The offending article: https://msdn.microsoft.com/en-us/library/azure/mt763293.aspx

The offending text snippet:

“Assembly_Source
Specifies the assembly DLL either in form of a binary literal or as a string literal or static string expression/string variable. The binary literal represents the actual .NET assembly DLL, while the string values represent a URI or file path to a .NET assembly DLL file in either an accessible Azure Data Lake Storage or Windows Azure Blob Storage. If the provided source is a valid .NET assembly, the assembly will be copied and registered, otherwise an error is raised.”

Before going any further, this post isn’t a dig at the usual lack of Microsoft documentation. Mainly because when I posted this problem as a question on Stack Overflow the missing information was provided from the horses mouth, Mr Michael Rys (@MikeDoesBigData). Therefore, all is forgiven and I’m more than happy to write this post on Microsoft’s behalf and for my fellow developers. #SQLFamily

http://stackoverflow.com/questions/40842170/create-usql-assembly-from-dll-in-azure-blob-storage

Thanks again Mike. Moving on…

Assumptions

Within your Azure subscription you have the following services already deployed and running.

  • Azure Data Lake Analytics (ADLa)
  • Azure Data Lake Store (ADLs)
  • Azure Storage, with a suitable blob container.

You are also comfortable with referencing assemblies in your U-SQL scripts and so far have done so by in lining the complied assembly in the U-SQL file. Or have stored the DLL in ADLs with a simple file path reference to the ADLs root directory.

Granting Access

The most important thing you’ll need to do to get this working, as Mike mentions in the SO answer, is allow your ADLa service to access the blob storage account. This only requires a few clicks in the Azure portal.

From the ADLa blade choose Data Sources and click Add Data Source.

Populate the preceding blade drop down menus with your preferred choices and click Add. You should then have the storage account listed as a ADLa data source. As below

Note; The Azure Storage account doesn’t need to be in the same data centre as the ADLa service, unlike ADLa and ADLs.

Create Assembly

Next the U-SQL.

To reference a DLL in the blob storage account container we need to create the assembly using the wasb URL. Like this:

wasb://YourBlobContainerName@YourBlobStorageAccountName.blob.core.windows.net/YourAssembly.dll

Complete CREATE ASSEMBLY syntax.

1
2
3
CREATE ASSEMBLY IF NOT EXISTS [YourSchema].[PurpleFrog.Pauls.DataLakeHelperFunctions]
FROM "wasb://AllSupportingFiles@MiscBlobsAccount.blob.core.windows.net/
      PurpleFrog.Pauls.DataLakeHelperFunctions.dll";

Why Do This

Hopefully pre-empting some comments on this post. Given that we can inline the assembly and store it in ADLs. Why would you want to put the DLL’s in a separate storage account?

Well, this is really just for operational convenience. In a recent project I was working on we had created a lot of custom code. Not just for Azure Data Lake, but also Azure Data Factory. We therefore used a blob storage account as a support bucket for all compiled code and parent object files. This gave us a centralised place to deploy to regardless of what service was consuming the libraries. Again, just for convenience. All DLL’s in one place for all services.

I hope you found this post helpful.

Many thanks for reading.


Creating a U-SQL Date Dimension & Numbers Table in Azure Data Lake

Now we all know what a date dimension is and there are plenty of really great examples out there for creating them in various languages. Well, here’s my U-SQL version creating the output from scratch using a numbers table. Remember that U-SQL needs to be handled slightly differently because we don’t have any iterative functionality available. Plus its ability to massively parallelise jobs means we can’t write something that relies on procedural code.

This is version 1…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
//Enter start and end dates for the date dimension:
DECLARE @StartDate DateTime = new DateTime(2017,1,1);
DECLARE @EndDate DateTime = new DateTime(2018,12,31);
 
//Create numbers table
@Numbers0 = SELECT * FROM (VALUES(0)) AS Row (Number);
@Numbers1 = SELECT [Number] FROM @Numbers0 
    UNION ALL SELECT [Number]+1 AS Number FROM @Numbers0;
@Numbers2 = SELECT [Number] FROM @Numbers1 
    UNION ALL SELECT [Number]+2 AS Number FROM @Numbers1;
@Numbers4 = SELECT [Number] FROM @Numbers2 
    UNION ALL SELECT [Number]+4 AS Number FROM @Numbers2;
@Numbers8 = SELECT [Number] FROM @Numbers4 
    UNION ALL SELECT [Number]+8 AS Number FROM @Numbers4;
@Numbers16 = SELECT [Number] FROM @Numbers8 
    UNION ALL SELECT [Number]+16 AS Number FROM @Numbers8;
@Numbers32 = SELECT [Number] FROM @Numbers16 
    UNION ALL SELECT [Number]+32 AS Number FROM @Numbers16;
@Numbers64 = SELECT [Number] FROM @Numbers32 
    UNION ALL SELECT [Number]+64 AS Number FROM @Numbers32;
//Double it again if you want it bigger...
 
//Create date dimension
@DateDimension = 
SELECT 
    int.Parse([Date].ToString("yyyyMMdd")) AS DateKey,
    [Date],
    [Date].ToString("dd/MM/yyyy") AS DateString,
    [Date].Day AS Day,
    [Date].Month AS Month,
    Math.Floor(((decimal)[Date].Month + 2) / 3) AS Quater,
    [Date].Year AS Year,
    Convert.ToInt32([Date].DayOfWeek) + 1 AS DayOfWeekNo,
    [Date].ToString("dddd") AS DayName,
    [Date].ToString("MMMM") AS MonthName,
    [Date].Month >=4 ? [Date].ToString("yyyy")+"/"+([Date].AddYears(+1)).ToString("yy") 
        : ([Date].Year - 1).ToString() + "/" + [Date].ToString("yy") AS FinancialYear,
    DateTimeFormatInfo.CurrentInfo.Calendar.GetWeekOfYear(
        [Date], CalendarWeekRule.FirstDay, System.DayOfWeek.Sunday) AS WeekNoOfYear
FROM
    (
    SELECT 
        @StartDate.AddDays(Convert.ToDouble([RowNumber]) -1) AS Date
    FROM 
        (
        SELECT
            ROW_NUMBER() OVER (ORDER BY n1.[Number]) AS RowNumber
        FROM 
            @Numbers64 AS n1
            CROSS JOIN @Numbers64 AS n2 //make it big!
        ) AS x
    ) AS y
WHERE
    [Date] <= @EndDate; //cheat to cut off results
 
 
//Output files
OUTPUT @DateDimension
TO "/Stuff/DateDimension.csv"
ORDER BY [Date] ASC
USING Outputters.Csv(quoting : true, outputHeader : true);
 
//Get a numbers table as a bonus :-)
OUTPUT @Numbers64
TO "/Stuff/Numbers.csv"
ORDER BY [Number] ASC
USING Outputters.Csv(quoting : true, outputHeader : true);

In version 2 I may use a replicated string that EXPLODE’s from an array rather than using a numbers table. But that’s for another time. I included the numbers table as an output in this one as a little bonus 🙂

Hope this helps you out while swimming in the Azure Data Lake.

Many thanks for reading.


Passing Parameters to U-SQL from Azure Data Factory

Let’s try and keep this post short and sweet. Diving right in imagine a scenario where we have an Azure Data Factory (ADF) pipeline that includes activities to perform U-SQL jobs in Azure Data Lake (ADL) Analytics. We want to control the U-SQL by passing the ADF time slice value to the script, hopefully a fairly common use case. This isn’t yet that intuitive when constructing the ADF JSON activity so I hope this post will save you some debugging time.

For my example I’ve created a stored procedure in my ADL Analytics database that accepts a parameter @TimeSliceStart as a string value in the format yyyyMMdd.

[ExampleDatabase].[dbo].[usp_DoSomeStuff](@TimeSliceStart);

This doesn’t have to be a stored procedure. ADF is also happy if you give if U-SQL files or even just inline the entire script. Regardless, the ADF parameter handling is the same.

In my ADF JSON activity I then have the following;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"name": "DataLakeJob1",
"properties": {
"description": "Run USQL with timeslice param",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
	"script": "[ExampleDatabase].[dbo].[usp_DoSomeStuff](@TimeSliceStart);",
	"degreeOfParallelism": 5,
	"priority": 1,
	"parameters": {
		"TimeSliceStart": "$$Text.Format('{0:yyyyMMdd}', Time.AddMinutes(SliceStart, 0))"
	}
} // etc ... } ] // etc ...}}}

Notice in the extended properties we have a parameters attribute that can include children of the actual variables we want to pass to the U-SQL script.

Here’s the important things to understand about this ADF parameters attribute.

  • The name of the parameter must match the name of the variable expected by U-SQL exactly.
  • As you would expect the data types of the expected variable and JSON parameter must match.
  • It is perfectly acceptable to have multiple parameters in the ADF JSON and written in any order.

So how does this work?…

What ADF does when calling ADL is take the parameters listed in the  JSON and write out a bunch of U-SQL ‘DECLARE @Variable’ lines. These then get appended to the top of the actual U-SQL script before giving it to ADL as a job to run. You can see this if you go into the ADL Analytics blades in the Azure portal, select the job created by the ADF activity then choose Duplicate Script. This reveals the actual U-SQL used in the job.

Here’s the proof.

 

 

 

 

 

 

 

 

 

 

Then…

Just knowing what ADF does when converting the JSON parameters to U-SQL declared variables is the main take away here.

That’s it! I promised short and sweet.

Many thanks for reading.


Writing a U-SQL Merge Statement

Unlike T-SQL, U-SQL does not currently support MERGE statements. Our friend that we have come to know and love since its introduction in SQL Server 2008. Not only that, but U-SQL also doesn’t currently support UPDATE statements either… I know… Open mouth emoji required! This immediately leads to the problem of change detection in our data and how, for example, we should handle the ingestion of a daily rolling 28-day TSV extract, requiring a complete year to date output. Well in this post we will solve that very problem.

Now before we move on it’s worth pointing out that U-SQL is now our/my language of choice for working with big data in Azure using the Data Lake Analytics service. It’s not yet the way of things for our on premises SQL Server databases, so relax. T-SQL or R are still our out-of-the-box tools there (SQL Server 2016). Also if you want to take a step back from this slightly deeper U-SQL topic and find out What is U-SQL first, I can recommend my purple amphibious colleagues blog, link below.

https://www.purplefrogsystems.com/blog/2016/02/what-is-u-sql/

Assuming you’re comfortable with the U-SQL basics let’s move on. For the below examples I’m working with Azure Data Lake (ADL) Analytics, deployed in my Azure MSDN subscription. Although we can do everything here in the local Visual Studio emulator, without the cloud service (very cool and handy for development). I also have the Visual Studio Data Lake tools for the service available and installed. Specifically for this topic I have created a ‘U-SQL Sample Application’ project to get us started. This is simply for ease of explanation and so you can get most of the setup code for what I’m doing here without any great difficulty. Visual Studio Data Lake tools download link below if needed.

https://www.microsoft.com/en-us/download/details.aspx?id=49504

newdatalakevssampleapp

Once we have this solution available including its Ambulance and Search Log samples please find where your Visual Studio Cloud Explorer panel (Ctrl + \, Ctrl + X) is hiding as we’ll use this to access the local Data Lake Analytics database on your machine.

vscloudandsolutionpanelsDatabase Objects

To get things rolling open the U-SQL file from the sample project called ‘SearchLog-4-CreatingTable’ and execute AKA ‘Submit’ this to run locally against your ADL Analytics instance. This gives us a database and target table to work with for the merge operation. It also inserts some tab separated sample data into the table. If we don’t insert this initial dataset you’ll find joining to an empty table will prove troublesome.

Now U-SQL is all about extracting and outputting at scale. There isn’t any syntax sugar to merge datasets. But do we really need the sugar? No. So, we are going to use a database table as our archive or holding area to ensure we get the desired ‘upsert’ behaviour. Then write our MERGE statement long hand using a series of conventional joins. Not syntax sugar. Just good old fashioned joining of datasets to get the old, the new and the changed.

Recap of the scenario; we have a daily rolling 28-day input, requiring a full year to date output.

Merging Data

Next open the U-SQL file from the sample project called ‘SearchLog-1-First_U-SQL_Script’. This is a reasonable template to adapt as it contains the EXTRACT and OUTPUT code blocks already.

For the MERGE we next need a set of three SELECT statements joining both the EXTRACT (new/changed data) and table (old data) together. These are as follows, in mostly English type syntax first 🙂

  • For the UPDATE, we’ll do an INNER JOIN. Table to file. Taking fields from the EXTRACT only.
  • For the INSERT, we’ll do a LEFT OUTER JOIN. File to table. Taking fields from the EXTRACT where NULL in the table.
  • To retain old data, we’ll do a RIGHT OUTER JOIN. File to table. Taking fields from the table where NULL in the file.

Each of the three SELECT statements can then have UNION ALL conditions between them to form a complete dataset including any changed values, new values and old values loaded by a previous file. This is the code you’ll want to add for the example in your open file between the extract and output code blocks. Please don’t just copy and paste without understanding it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
@AllData =
    --update current
    SELECT e1.[UserId],
           e1.[START],
           e1.[Region],
           e1.[Query],
           e1.[Duration],
           e1.[Urls],
           e1.[ClickedUrls]
    FROM [SearchLogDemo].[dbo].[SearchLog] AS t1
         INNER JOIN
             @searchlog AS e1
         ON t1.[UserId] == e1.[UserId]
 
    UNION ALL
 
    --insert new
    SELECT e2.[UserId],
           e2.[START],
           e2.[Region],
           e2.[Query],
           e2.[Duration],
           e2.[Urls],
           e2.[ClickedUrls]
    FROM @searchlog AS e2
         LEFT OUTER JOIN
             [SearchLogDemo].[dbo].[SearchLog] AS t2
         ON t2.[UserId] == e2.[UserId]
    WHERE
    t2.[UserId] IS NULL
 
    UNION ALL
 
    --keep existing
    SELECT t3.[UserId],
           t3.[START],
           t3.[Region],
           t3.[Query],
           t3.[Duration],
           t3.[Urls],
           t3.[ClickedUrls]
    FROM @searchlog AS e3
         RIGHT OUTER JOIN
             [SearchLogDemo].[dbo].[SearchLog] AS t3
         ON t3.[UserId] == e3.[UserId]
    WHERE
    e3.[UserId] IS NULL;

This union of data can then OUTPUT to our usable destination doing what U-SQL does well before resetting our ADL Analytics table for the next load. By reset, I mean TRUNCATE the table and INSERT everything from @AllData back into it. This preserves our history/our old data and allows the MERGE behaviour to work again and again using only SELECT statements.

Replacing the OUTPUT variable from @searchlog, you’ll then want to add the following code below the three SELECT statements.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
OUTPUT @AllData
TO "/output/SearchLogAllData.csv"
USING Outputters.Csv();
 
TRUNCATE TABLE [SearchLogDemo].[dbo].[SearchLog];
 
INSERT INTO [SearchLogDemo].[dbo].[SearchLog]
(
    [UserId],
    [START],
    [Region],
    [Query],
    [Duration],
    [Urls],
    [ClickedUrls]
)
SELECT [UserId],
       [START],
       [Region],
       [Query],
       [Duration],
       [Urls],
       [ClickedUrls]
FROM @AllData;

If all goes well you can edit the ‘SearchLog.tsv’ file removing and changing data and keep rerunning the U-SQL script performing the MERGE behaviour. Please test away. Don’t just believe me that it works. As a bonus you get this pretty job diagram too…

localusqlmergejob

The only caveat here is that we can’t deal with deletion detection from the source file… Unless we do something a little more complex for the current loading period. Lets save that for a later blog post.

A couple of follow up general tips.

  • Have a USE condition at the top of your scripts to ensure you hit the correct database. Just like T-SQL.
  • If your struggling for fields to join on as you don’t have a primary key. You could use UNION instead of UNION ALL. But this of course takes more effort to work out the distinct values. Just like T-SQL.
  • Be careful with C# data types and case sensitivity. U-SQL is not as casual as T-SQL with such things.

That’s it. U-SQL merge behaviour achieved. I guess the bigger lesson here for the many T-SQL people out there is; don’t forget the basics, its still a structured query language. Syntax sugar is sweet, but not essential.

Hope this was helpful.

Many thanks for reading


Paul’s Frog Blog

Paul is a Microsoft Data Platform MVP with 10+ years’ experience working with the complete on premises SQL Server stack in a variety of roles and industries. Now as the Business Intelligence Consultant at Purple Frog Systems has turned his keyboard to big data solutions in the Microsoft cloud. Specialising in Azure Data Lake Analytics, Azure Data Factory, Azure Stream Analytics, Event Hubs and IoT. Paul is also a STEM Ambassador for the networking education in schools’ programme, PASS chapter leader for the Microsoft Data Platform Group – Birmingham, SQL Bits, SQL Relay, SQL Saturday speaker and helper. Currently the Stack Overflow top user for Azure Data Factory. As well as very active member of the technical community.
Thanks for visiting.
@mrpaulandrew