Monthly Archives: May 2017

Cognitive Services with U-SQL (Reference Guide)

This post is a reference guide to support an event talk or webinar. The content is intended to assist the audience only. Thank you.

Abstract

Microsoft’s Cognitive Services are basically the best thing since sliced bread, especially for anybody working with data. Artificial intelligence just got packaged and made available for the masses to download. In this short talk, I’ll take you on a whirl wind tour of how to use these massively powerful libraries directly in Azure Data Lake with that offspring of T-SQL and C# … U-SQL. How do you get hold of the DLL’s and how can you wire them up for yourself?… Everything will be revealed as well as the chance to see what the machines make of the audience!

Links

Helpful Bits

Why U-SQL?

  • U for unified. Unifying T-SQL and C#.
  • U is the next letter after T. T-SQL > U-SQL.
  • U for U-Boat, because Mike Rys dives into his Data Lake with a U-Boat 🙂

Installing the U-SQL samples and extension files in your Data Lake Storage.

The executed code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
USE [CognitiveServices];
 
REFERENCE ASSEMBLY ImageCommon;
REFERENCE ASSEMBLY FaceSdk;
REFERENCE ASSEMBLY ImageEmotion;
REFERENCE ASSEMBLY ImageTagging;
REFERENCE ASSEMBLY ImageOcr;
 
--Extract the number of objects on each image and tag them 
@imgs =
    EXTRACT 
        FileName string, 
        [ImgData] byte[]
    FROM 
        @"/Images/{FileName}.jpg"
    USING 
        NEW Cognition.Vision.ImageExtractor();
 
@imgTags =
    PROCESS 
        @imgs 
    PRODUCE 
        [FileName],
        [NumObjects] INT,
        [Tags] string
    READONLY 
        [FileName]
    USING 
        NEW Cognition.Vision.ImageTagger();
 
OUTPUT @imgTags
TO "/output/ImageTags.csv"
USING Outputters.Csv(quoting : TRUE, outputHeader : TRUE);

 

Recursive U-SQL With PowerShell (U-SQL Looping)

In its natural form U-SQL does not support recursive operations and for good reason. This is a big data, scale out, declarative language where the inclusion of procedural, iterative code would be very unnatural. That said, if you must pervert things PowerShell can assist with the looping and dare I say the possibility for dynamic U-SQL.

A couple of caveats…

  • From the outset, I accept this abstraction with PowerShell to achieve some iterative process in U-SQL is a bit of a hack and very inefficient, certainly in the below example.
  • The scenario here is not perfect and created using a simple situation for the purposes of explanation only. I know the same data results could be achieved just by extending the aggregate grouping!

Hopefully that sets the scene. As I’m writing, I’m wondering if this blog post will be hated by the purists out there. Or loved by the abstraction hackers. I think I’m both 🙂

Scenario

As with most of my U-SQL blog posts I’m going to start with the Visual Studio project available as part of the data lake tools extension called ‘U-SQL Sample Application’. This gives us all the basic start up code and sample data to get going.

Input: within the solution (Samples > Data > Ambulance Data) we have some CSV files for vehicles. These are separated into 16 sources datasets covering 4 different vehicle ID’s across 4 days.

Output: let’s say we have a requirement to find out the average speed of each vehicle per day. Easily enough with a U-SQL wildcard on the extractor. But we also want to output a single file for each day of data. Not so easy, unless we write 1 query for each day of data. Fine with samples only covering 4 days, not so fine with 2 years of records split by vehicle.

Scenario set, lets look at how we might do this.

The U-SQL

To produce the required daily outputs I’m going to use a U-SQL query in isolation to return a distinct list of dates across the 16 input datasets, plus a parameterised U-SQL stored procedure to do the aggregation and output a single day of data.

First getting the dates. The below simply returns a text file containing a distinct list of all the dates in our source data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
DECLARE @InputPath string = "/Samples/Data/AmbulanceData/{filename}";
 
@DATA =
    EXTRACT 
        [vehicle_id] INT,
        [entry_id] long,
        [event_date] DateTime,
        [latitude] FLOAT,
        [longitude] FLOAT,
        [speed] INT,
        [direction] string,
        [trip_id] INT?,
        [filename] string
    FROM 
        @InputPath
    USING 
        Extractors.Csv();
 
@DateList =
    SELECT DISTINCT 
        [event_date].ToString("yyyyMMdd") AS EventDateList
    FROM 
        @DATA;
 
OUTPUT @DateList
TO "/output/AmbulanceDataDateList.txt"
USING Outputters.Csv(quoting : FALSE, outputHeader : FALSE);

Next, the below stored procedure. This uses the same input files, but does the required aggregation and outputs a daily file matching the parameter passed giving us a sinlge output.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
CREATE PROCEDURE IF NOT EXISTS [dbo].[usp_OutputDailyAvgSpeed]
    (
    @OutputDate string
    )
AS
BEGIN
 
    //DECLARE @OutputDate string = "20140914"; //FOR dev
    DECLARE @InputPath string = "/Samples/Data/AmbulanceData/{filename}";
    DECLARE @OutputPath string = "/output/DailyRecords/VehicleAvgSpeed_" + @OutputDate + ".csv";
 
    @DATA =
        EXTRACT 
            [vehicle_id] INT,
            [entry_id] long,
            [event_date] DateTime,
            [latitude] FLOAT,
            [longitude] FLOAT,
            [speed] INT,
            [direction] string,
            [trip_id] INT?,
            [filename] string
        FROM 
            @InputPath
        USING 
            Extractors.Csv();
 
    @VAvgSpeed =
        SELECT DISTINCT 
            [vehicle_id],
            AVG([speed]) AS AverageSpeed
        FROM 
            @DATA
        WHERE
            [event_date].ToString("yyyyMMdd") == @OutputDate
        GROUP BY
            [vehicle_id];
 
    OUTPUT @VAvgSpeed
    TO @OutputPath
    USING Outputters.Csv(quoting : TRUE, outputHeader : TRUE);
 
END;

At this point, we could just execute the stored procedures with each required date, manually crafted from the text file. Like this:

1
2
3
4
[dbo].[usp_OutputDailyAvgSpeed]("20140914");
[dbo].[usp_OutputDailyAvgSpeed]("20140915");
[dbo].[usp_OutputDailyAvgSpeed]("20140916");
[dbo].[usp_OutputDailyAvgSpeed]("20140917");

Fine, for small amounts of data, but we can do better for larger datasets.

Enter PowerShell and some looping.

The PowerShell

As with all things Microsoft PowerShell is our friend and the supporting cmdlets for the Azure Data Lake services are no exception. I recommend these links if you haven’t yet written some PowerShell to control ADL Analytics jobs or upload files to ADL Storage.

Moving on. How can PowerShell help us script our data output requirements? Well, here’s the answer, in my PowerShell script below I’ve done the following.

  1. Authenticate against my Azure subscription (optionally create yourself a PSCredential to do this).
  2. Submit the first U-SQL query as a file to return the distinct list of dates.
  3. Wait for the ADL Analytics job to complete.
  4. Download the output text file from ADL storage.
  5. Read the contents of the text file.
  6. Iterate over each dates listed in the text file.
  7. Submit a U-SQL job for each stored procedure with the date passed from the list.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#Params...
$WhereAmI = $MyInvocation.MyCommand.Path.Replace($MyInvocation.MyCommand.Name,"")
 
$DLAnalyticsName = "myfirstdatalakeanalysis" 
$DLAnalyticsDoP = 10
$DLStoreName = "myfirstdatalakestore01"
 
 
#Create Azure Connection
Login-AzureRmAccount | Out-Null
 
$USQLFile = $WhereAmI + "RecursiveOutputPrep.usql"
$PrepOutput = $WhereAmI + "AmbulanceDataDateList.txt"
 
#Summit Job
$job = Submit-AzureRmDataLakeAnalyticsJob `
    -Name "GetDateList" `
    -AccountName $DLAnalyticsName `
    –ScriptPath $USQLFile `
    -DegreeOfParallelism $DLAnalyticsDoP
 
Write-Host "Submitted USQL prep job."
 
#Wait for job to complete
Wait-AdlJob -Account $DLAnalyticsName -JobId $job.JobId | Out-Null
 
Write-Host "Downloading USQL output file."
 
#Download date list
Export-AzureRmDataLakeStoreItem `
    -AccountName $DLStoreName `
    -Path $myrootdir\output\AmbulanceDataDateList.csv `
    -Destination $PrepOutput | Out-Null
 
Write-Host "Downloaded USQL output file."
 
#Read dates
$Dates = Get-Content $PrepOutput
 
Write-Host "Read date list."
 
#Loop over dates with proc call for each
ForEach ($Date in $Dates)
    {
    $USQLProcCall = '[dbo].[usp_OutputDailyAvgSpeed]("' + $Date + '");'
    $JobName = 'Output daily avg dataset for ' + $Date
 
    Write-Host $USQLProcCall
 
    $job = Submit-AzureRmDataLakeAnalyticsJob `
        -Name $JobName `
        -AccountName $DLAnalyticsName `
        –Script $USQLProcCall `
        -DegreeOfParallelism $DLAnalyticsDoP
 
    Write-Host "Job submitted for " $Date
    }
 
Write-Host "Script complete. USQL jobs running."

At this point I think its worth reminding you of my caveats above 🙂

I would like to point out the flexibility in the PowerShell cmdlet Submit-AzureRmDataLakeAnalyticsJob. Allowing us to pass a U-SQL file (step 2) or build up a U-SQL string dynamically within the PowerShell script and pass that as the execution code (step 7), very handy. Switches: -Script or -ScriptPath.

If all goes well you should have jobs being prepared and shortly after running to produce the daily output files.

I used 10 AU’s for my jobs because I wanted to burn up some old Azure credits, but you can change this in the PowerShell variable $DLAnalyticsDoP.

Conclusion

It’s possible to archive looping behaviour with U-SQL when we want to produce multiple output files, but only when we abstract the iterative behaviour away to our friend PowerShell.

Comments welcome on this approach.

Many thanks for reading.

 

Ps. To make life a little easier. I’ve stuck all of the above code and sample data into a GitHub repostiory to save you copy and pasting things from the code windows above.

https://github.com/mrpaulandrew/RecursiveU-SQLWithPowerShell

 

 


Paul’s Frog Blog

Paul is a Microsoft Data Platform MVP with 10+ years’ experience working with the complete on premises SQL Server stack in a variety of roles and industries. Now as the Business Intelligence Consultant at Purple Frog Systems has turned his keyboard to big data solutions in the Microsoft cloud. Specialising in Azure Data Lake Analytics, Azure Data Factory, Azure Stream Analytics, Event Hubs and IoT. Paul is also a STEM Ambassador for the networking education in schools’ programme, PASS chapter leader for the Microsoft Data Platform Group – Birmingham, SQL Bits, SQL Relay, SQL Saturday speaker and helper. Currently the Stack Overflow top user for Azure Data Factory. As well as very active member of the technical community.
Thanks for visiting.
@mrpaulandrew