In its natural form U-SQL does not support recursive operations and for good reason. This is a big data, scale out, declarative language where the inclusion of procedural, iterative code would be very unnatural. That said, if you must pervert things PowerShell can assist with the looping and dare I say the possibility for dynamic U-SQL.

A couple of caveats…

  • From the outset, I accept this abstraction with PowerShell to achieve some iterative process in U-SQL is a bit of a hack and very inefficient, certainly in the below example.
  • The scenario here is not perfect and created using a simple situation for the purposes of explanation only. I know the same data results could be achieved just by extending the aggregate grouping!

Hopefully that sets the scene. As I’m writing, I’m wondering if this blog post will be hated by the purists out there. Or loved by the abstraction hackers. I think I’m both 🙂

Scenario

As with most of my U-SQL blog posts I’m going to start with the Visual Studio project available as part of the data lake tools extension called ‘U-SQL Sample Application’. This gives us all the basic start up code and sample data to get going.

Input: within the solution (Samples > Data > Ambulance Data) we have some CSV files for vehicles. These are separated into 16 sources datasets covering 4 different vehicle ID’s across 4 days.

Output: let’s say we have a requirement to find out the average speed of each vehicle per day. Easily enough with a U-SQL wildcard on the extractor. But we also want to output a single file for each day of data. Not so easy, unless we write 1 query for each day of data. Fine with samples only covering 4 days, not so fine with 2 years of records split by vehicle.

Scenario set, lets look at how we might do this.

The U-SQL

To produce the required daily outputs I’m going to use a U-SQL query in isolation to return a distinct list of dates across the 16 input datasets, plus a parameterised U-SQL stored procedure to do the aggregation and output a single day of data.

First getting the dates. The below simply returns a text file containing a distinct list of all the dates in our source data.

DECLARE @InputPath string = "/Samples/Data/AmbulanceData/{filename}";

@data =
    EXTRACT 
        [vehicle_id] int,
        [entry_id] long,
        [event_date] DateTime,
        [latitude] float,
        [longitude] float,
        [speed] int,
        [direction] string,
        [trip_id] int?,
        [filename] string
    FROM 
        @InputPath
    USING 
        Extractors.Csv();

@DateList =
    SELECT DISTINCT 
        [event_date].ToString("yyyyMMdd") AS EventDateList
    FROM 
        @data;

OUTPUT @DateList
TO "/output/AmbulanceDataDateList.txt"
USING Outputters.Csv(quoting : false, outputHeader : false);

Next, the below stored procedure. This uses the same input files, but does the required aggregation and outputs a daily file matching the parameter passed giving us a sinlge output.

CREATE PROCEDURE IF NOT EXISTS [dbo].[usp_OutputDailyAvgSpeed]
    (
    @OutputDate string
    )
AS
BEGIN
    
    //DECLARE @OutputDate string = "20140914"; //for dev
    DECLARE @InputPath string = "/Samples/Data/AmbulanceData/{filename}";
    DECLARE @OutputPath string = "/output/DailyRecords/VehicleAvgSpeed_" + @OutputDate + ".csv";

    @data =
        EXTRACT 
            [vehicle_id] int,
            [entry_id] long,
            [event_date] DateTime,
            [latitude] float,
            [longitude] float,
            [speed] int,
            [direction] string,
            [trip_id] int?,
            [filename] string
        FROM 
            @InputPath
        USING 
            Extractors.Csv();

    @VAvgSpeed =
        SELECT DISTINCT 
            [vehicle_id],
            AVG([speed]) AS AverageSpeed
        FROM 
            @data
        WHERE
            [event_date].ToString("yyyyMMdd") == @OutputDate
        GROUP BY
            [vehicle_id];

    OUTPUT @VAvgSpeed
    TO @OutputPath
    USING Outputters.Csv(quoting : true, outputHeader : true);
    
END;

At this point, we could just execute the stored procedures with each required date, manually crafted from the text file. Like this:

[dbo].[usp_OutputDailyAvgSpeed]("20140914");
[dbo].[usp_OutputDailyAvgSpeed]("20140915");
[dbo].[usp_OutputDailyAvgSpeed]("20140916");
[dbo].[usp_OutputDailyAvgSpeed]("20140917");

Fine, for small amounts of data, but we can do better for larger datasets.

Enter PowerShell and some looping.

The PowerShell

As with all things Microsoft PowerShell is our friend and the supporting cmdlets for the Azure Data Lake services are no exception. I recommend these links if you haven’t yet written some PowerShell to control ADL Analytics jobs or upload files to ADL Storage.

Moving on. How can PowerShell help us script our data output requirements? Well, here’s the answer, in my PowerShell script below I’ve done the following.

  1. Authenticate against my Azure subscription (optionally create yourself a PSCredential to do this).
  2. Submit the first U-SQL query as a file to return the distinct list of dates.
  3. Wait for the ADL Analytics job to complete.
  4. Download the output text file from ADL storage.
  5. Read the contents of the text file.
  6. Iterate over each dates listed in the text file.
  7. Submit a U-SQL job for each stored procedure with the date passed from the list.
#Params...
$WhereAmI = $MyInvocation.MyCommand.Path.Replace($MyInvocation.MyCommand.Name,"")

$DLAnalyticsName = "myfirstdatalakeanalysis" 
$DLAnalyticsDoP = 10
$DLStoreName = "myfirstdatalakestore01"


#Create Azure Connection
Login-AzureRmAccount | Out-Null

$USQLFile = $WhereAmI + "RecursiveOutputPrep.usql"
$PrepOutput = $WhereAmI + "AmbulanceDataDateList.txt"

#Summit Job
$job = Submit-AzureRmDataLakeAnalyticsJob `
    -Name "GetDateList" `
    -AccountName $DLAnalyticsName `
    –ScriptPath $USQLFile `
    -DegreeOfParallelism $DLAnalyticsDoP

Write-Host "Submitted USQL prep job."

#Wait for job to complete
Wait-AdlJob -Account $DLAnalyticsName -JobId $job.JobId | Out-Null

Write-Host "Downloading USQL output file."

#Download date list
Export-AzureRmDataLakeStoreItem `
    -AccountName $DLStoreName `
    -Path $myrootdir\output\AmbulanceDataDateList.csv `
    -Destination $PrepOutput | Out-Null

Write-Host "Downloaded USQL output file."

#Read dates
$Dates = Get-Content $PrepOutput

Write-Host "Read date list."

#Loop over dates with proc call for each
ForEach ($Date in $Dates)
    {
    $USQLProcCall = '[dbo].[usp_OutputDailyAvgSpeed]("' + $Date + '");'
    $JobName = 'Output daily avg dataset for ' + $Date

    Write-Host $USQLProcCall

    $job = Submit-AzureRmDataLakeAnalyticsJob `
        -Name $JobName `
        -AccountName $DLAnalyticsName `
        –Script $USQLProcCall `
        -DegreeOfParallelism $DLAnalyticsDoP

    Write-Host "Job submitted for " $Date
    }
	
Write-Host "Script complete. USQL jobs running."

At this point I think its worth reminding you of my caveats above 🙂

I would like to point out the flexibility in the PowerShell cmdlet Submit-AzureRmDataLakeAnalyticsJob. Allowing us to pass a U-SQL file (step 2) or build up a U-SQL string dynamically within the PowerShell script and pass that as the execution code (step 7), very handy. Switches: -Script or -ScriptPath.

If all goes well you should have jobs being prepared and shortly after running to produce the daily output files.

I used 10 AU’s for my jobs because I wanted to burn up some old Azure credits, but you can change this in the PowerShell variable $DLAnalyticsDoP.

Conclusion

It’s possible to archive looping behaviour with U-SQL when we want to produce multiple output files, but only when we abstract the iterative behaviour away to our friend PowerShell.

Comments welcome on this approach.

Many thanks for reading.

 

Ps. To make life a little easier. I’ve stuck all of the above code and sample data into a GitHub repostiory to save you copy and pasting things from the code windows above.

https://github.com/mrpaulandrew/RecursiveU-SQLWithPowerShell

 

 


Tags: , , ,