In its natural form U-SQL does not support recursive operations and for good reason. This is a big data, scale out, declarative language where the inclusion of procedural, iterative code would be very unnatural. That said, if you must pervert things PowerShell can assist with the looping and dare I say the possibility for dynamic U-SQL.
A couple of caveats…
- From the outset, I accept this abstraction with PowerShell to achieve some iterative process in U-SQL is a bit of a hack and very inefficient, certainly in the below example.
- The scenario here is not perfect and created using a simple situation for the purposes of explanation only. I know the same data results could be achieved just by extending the aggregate grouping!
Hopefully that sets the scene. As I’m writing, I’m wondering if this blog post will be hated by the purists out there. Or loved by the abstraction hackers. I think I’m both 🙂
Scenario
As with most of my U-SQL blog posts I’m going to start with the Visual Studio project available as part of the data lake tools extension called ‘U-SQL Sample Application’. This gives us all the basic start up code and sample data to get going.
Input: within the solution (Samples > Data > Ambulance Data) we have some CSV files for vehicles. These are separated into 16 sources datasets covering 4 different vehicle ID’s across 4 days.
Output: let’s say we have a requirement to find out the average speed of each vehicle per day. Easily enough with a U-SQL wildcard on the extractor. But we also want to output a single file for each day of data. Not so easy, unless we write 1 query for each day of data. Fine with samples only covering 4 days, not so fine with 2 years of records split by vehicle.
Scenario set, lets look at how we might do this.
The U-SQL
To produce the required daily outputs I’m going to use a U-SQL query in isolation to return a distinct list of dates across the 16 input datasets, plus a parameterised U-SQL stored procedure to do the aggregation and output a single day of data.
First getting the dates. The below simply returns a text file containing a distinct list of all the dates in our source data.
DECLARE @InputPath string = "/Samples/Data/AmbulanceData/{filename}";
@data =
EXTRACT
[vehicle_id] int,
[entry_id] long,
[event_date] DateTime,
[latitude] float,
[longitude] float,
[speed] int,
[direction] string,
[trip_id] int?,
[filename] string
FROM
@InputPath
USING
Extractors.Csv();
@DateList =
SELECT DISTINCT
[event_date].ToString("yyyyMMdd") AS EventDateList
FROM
@data;
OUTPUT @DateList
TO "/output/AmbulanceDataDateList.txt"
USING Outputters.Csv(quoting : false, outputHeader : false);
Next, the below stored procedure. This uses the same input files, but does the required aggregation and outputs a daily file matching the parameter passed giving us a sinlge output.
CREATE PROCEDURE IF NOT EXISTS [dbo].[usp_OutputDailyAvgSpeed]
(
@OutputDate string
)
AS
BEGIN
//DECLARE @OutputDate string = "20140914"; //for dev
DECLARE @InputPath string = "/Samples/Data/AmbulanceData/{filename}";
DECLARE @OutputPath string = "/output/DailyRecords/VehicleAvgSpeed_" + @OutputDate + ".csv";
@data =
EXTRACT
[vehicle_id] int,
[entry_id] long,
[event_date] DateTime,
[latitude] float,
[longitude] float,
[speed] int,
[direction] string,
[trip_id] int?,
[filename] string
FROM
@InputPath
USING
Extractors.Csv();
@VAvgSpeed =
SELECT DISTINCT
[vehicle_id],
AVG([speed]) AS AverageSpeed
FROM
@data
WHERE
[event_date].ToString("yyyyMMdd") == @OutputDate
GROUP BY
[vehicle_id];
OUTPUT @VAvgSpeed
TO @OutputPath
USING Outputters.Csv(quoting : true, outputHeader : true);
END;
At this point, we could just execute the stored procedures with each required date, manually crafted from the text file. Like this:
[dbo].[usp_OutputDailyAvgSpeed]("20140914");
[dbo].[usp_OutputDailyAvgSpeed]("20140915");
[dbo].[usp_OutputDailyAvgSpeed]("20140916");
[dbo].[usp_OutputDailyAvgSpeed]("20140917");
Fine, for small amounts of data, but we can do better for larger datasets.
Enter PowerShell and some looping.
The PowerShell
As with all things Microsoft PowerShell is our friend and the supporting cmdlets for the Azure Data Lake services are no exception. I recommend these links if you haven’t yet written some PowerShell to control ADL Analytics jobs or upload files to ADL Storage.
- Get started with Azure Data Lake Store using Azure PowerShell
- Get started with Azure Data Lake Analytics using Azure PowerShell
Moving on. How can PowerShell help us script our data output requirements? Well, here’s the answer, in my PowerShell script below I’ve done the following.
- Authenticate against my Azure subscription (optionally create yourself a PSCredential to do this).
- Submit the first U-SQL query as a file to return the distinct list of dates.
- Wait for the ADL Analytics job to complete.
- Download the output text file from ADL storage.
- Read the contents of the text file.
- Iterate over each dates listed in the text file.
- Submit a U-SQL job for each stored procedure with the date passed from the list.
#Params...
$WhereAmI = $MyInvocation.MyCommand.Path.Replace($MyInvocation.MyCommand.Name,"")
$DLAnalyticsName = "myfirstdatalakeanalysis"
$DLAnalyticsDoP = 10
$DLStoreName = "myfirstdatalakestore01"
#Create Azure Connection
Login-AzureRmAccount | Out-Null
$USQLFile = $WhereAmI + "RecursiveOutputPrep.usql"
$PrepOutput = $WhereAmI + "AmbulanceDataDateList.txt"
#Summit Job
$job = Submit-AzureRmDataLakeAnalyticsJob `
-Name "GetDateList" `
-AccountName $DLAnalyticsName `
–ScriptPath $USQLFile `
-DegreeOfParallelism $DLAnalyticsDoP
Write-Host "Submitted USQL prep job."
#Wait for job to complete
Wait-AdlJob -Account $DLAnalyticsName -JobId $job.JobId | Out-Null
Write-Host "Downloading USQL output file."
#Download date list
Export-AzureRmDataLakeStoreItem `
-AccountName $DLStoreName `
-Path $myrootdir\output\AmbulanceDataDateList.csv `
-Destination $PrepOutput | Out-Null
Write-Host "Downloaded USQL output file."
#Read dates
$Dates = Get-Content $PrepOutput
Write-Host "Read date list."
#Loop over dates with proc call for each
ForEach ($Date in $Dates)
{
$USQLProcCall = '[dbo].[usp_OutputDailyAvgSpeed]("' + $Date + '");'
$JobName = 'Output daily avg dataset for ' + $Date
Write-Host $USQLProcCall
$job = Submit-AzureRmDataLakeAnalyticsJob `
-Name $JobName `
-AccountName $DLAnalyticsName `
–Script $USQLProcCall `
-DegreeOfParallelism $DLAnalyticsDoP
Write-Host "Job submitted for " $Date
}
Write-Host "Script complete. USQL jobs running."
At this point I think its worth reminding you of my caveats above 🙂
I would like to point out the flexibility in the PowerShell cmdlet Submit-AzureRmDataLakeAnalyticsJob. Allowing us to pass a U-SQL file (step 2) or build up a U-SQL string dynamically within the PowerShell script and pass that as the execution code (step 7), very handy. Switches: -Script or -ScriptPath.
If all goes well you should have jobs being prepared and shortly after running to produce the daily output files.
I used 10 AU’s for my jobs because I wanted to burn up some old Azure credits, but you can change this in the PowerShell variable $DLAnalyticsDoP.
Conclusion
It’s possible to archive looping behaviour with U-SQL when we want to produce multiple output files, but only when we abstract the iterative behaviour away to our friend PowerShell.
Comments welcome on this approach.
Many thanks for reading.
Ps. To make life a little easier. I’ve stuck all of the above code and sample data into a GitHub repostiory to save you copy and pasting things from the code windows above.
https://github.com/mrpaulandrew/RecursiveU-SQLWithPowerShell
Thanks for sharing such a nice article.!!
Can it be possible using azure data factory, not power shell? I have a list of employee ids and I want to iterate them but parameter (employee id) should be passed from azure data factory, not using power shell. Does it make sense?
Thanks
Nutan Patel
Hi Nutan, In ADFv1, no. However, in ADFv2 we now have a ‘ForEach’ activity type available. So probably possible using that with some expression and metadata. Cheers Paul
Hi Paul,
your blog is so helpful, as a novice Azure developer.
I have a question. Can I traverse through a directory and read all folder and file names, instead of extracting complete data.
Please help.
thanks
Sanju
Hi Sanjay, thanks for the comment. My first thoughts for achieving that is to use the ADFv2 Lookup activity. With U-SQL a virtual column would also do it, but at the cost of reading the data too. Cheers Paul