Dynamic

Recursive U-SQL With PowerShell (U-SQL Looping)

In its natural form U-SQL does not support recursive operations and for good reason. This is a big data, scale out, declarative language where the inclusion of procedural, iterative code would be very unnatural. That said, if you must pervert things PowerShell can assist with the looping and dare I say the possibility for dynamic U-SQL.

A couple of caveats…

  • From the outset, I accept this abstraction with PowerShell to achieve some iterative process in U-SQL is a bit of a hack and very inefficient, certainly in the below example.
  • The scenario here is not perfect and created using a simple situation for the purposes of explanation only. I know the same data results could be achieved just by extending the aggregate grouping!

Hopefully that sets the scene. As I’m writing, I’m wondering if this blog post will be hated by the purists out there. Or loved by the abstraction hackers. I think I’m both 🙂

Scenario

As with most of my U-SQL blog posts I’m going to start with the Visual Studio project available as part of the data lake tools extension called ‘U-SQL Sample Application’. This gives us all the basic start up code and sample data to get going.

Input: within the solution (Samples > Data > Ambulance Data) we have some CSV files for vehicles. These are separated into 16 sources datasets covering 4 different vehicle ID’s across 4 days.

Output: let’s say we have a requirement to find out the average speed of each vehicle per day. Easily enough with a U-SQL wildcard on the extractor. But we also want to output a single file for each day of data. Not so easy, unless we write 1 query for each day of data. Fine with samples only covering 4 days, not so fine with 2 years of records split by vehicle.

Scenario set, lets look at how we might do this.

The U-SQL

To produce the required daily outputs I’m going to use a U-SQL query in isolation to return a distinct list of dates across the 16 input datasets, plus a parameterised U-SQL stored procedure to do the aggregation and output a single day of data.

First getting the dates. The below simply returns a text file containing a distinct list of all the dates in our source data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
DECLARE @InputPath string = "/Samples/Data/AmbulanceData/{filename}";
 
@DATA =
    EXTRACT 
        [vehicle_id] INT,
        [entry_id] long,
        [event_date] DateTime,
        [latitude] FLOAT,
        [longitude] FLOAT,
        [speed] INT,
        [direction] string,
        [trip_id] INT?,
        [filename] string
    FROM 
        @InputPath
    USING 
        Extractors.Csv();
 
@DateList =
    SELECT DISTINCT 
        [event_date].ToString("yyyyMMdd") AS EventDateList
    FROM 
        @DATA;
 
OUTPUT @DateList
TO "/output/AmbulanceDataDateList.txt"
USING Outputters.Csv(quoting : FALSE, outputHeader : FALSE);

Next, the below stored procedure. This uses the same input files, but does the required aggregation and outputs a daily file matching the parameter passed giving us a sinlge output.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
CREATE PROCEDURE IF NOT EXISTS [dbo].[usp_OutputDailyAvgSpeed]
    (
    @OutputDate string
    )
AS
BEGIN
 
    //DECLARE @OutputDate string = "20140914"; //FOR dev
    DECLARE @InputPath string = "/Samples/Data/AmbulanceData/{filename}";
    DECLARE @OutputPath string = "/output/DailyRecords/VehicleAvgSpeed_" + @OutputDate + ".csv";
 
    @DATA =
        EXTRACT 
            [vehicle_id] INT,
            [entry_id] long,
            [event_date] DateTime,
            [latitude] FLOAT,
            [longitude] FLOAT,
            [speed] INT,
            [direction] string,
            [trip_id] INT?,
            [filename] string
        FROM 
            @InputPath
        USING 
            Extractors.Csv();
 
    @VAvgSpeed =
        SELECT DISTINCT 
            [vehicle_id],
            AVG([speed]) AS AverageSpeed
        FROM 
            @DATA
        WHERE
            [event_date].ToString("yyyyMMdd") == @OutputDate
        GROUP BY
            [vehicle_id];
 
    OUTPUT @VAvgSpeed
    TO @OutputPath
    USING Outputters.Csv(quoting : TRUE, outputHeader : TRUE);
 
END;

At this point, we could just execute the stored procedures with each required date, manually crafted from the text file. Like this:

1
2
3
4
[dbo].[usp_OutputDailyAvgSpeed]("20140914");
[dbo].[usp_OutputDailyAvgSpeed]("20140915");
[dbo].[usp_OutputDailyAvgSpeed]("20140916");
[dbo].[usp_OutputDailyAvgSpeed]("20140917");

Fine, for small amounts of data, but we can do better for larger datasets.

Enter PowerShell and some looping.

The PowerShell

As with all things Microsoft PowerShell is our friend and the supporting cmdlets for the Azure Data Lake services are no exception. I recommend these links if you haven’t yet written some PowerShell to control ADL Analytics jobs or upload files to ADL Storage.

Moving on. How can PowerShell help us script our data output requirements? Well, here’s the answer, in my PowerShell script below I’ve done the following.

  1. Authenticate against my Azure subscription (optionally create yourself a PSCredential to do this).
  2. Submit the first U-SQL query as a file to return the distinct list of dates.
  3. Wait for the ADL Analytics job to complete.
  4. Download the output text file from ADL storage.
  5. Read the contents of the text file.
  6. Iterate over each dates listed in the text file.
  7. Submit a U-SQL job for each stored procedure with the date passed from the list.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#Params...
$WhereAmI = $MyInvocation.MyCommand.Path.Replace($MyInvocation.MyCommand.Name,"")
 
$DLAnalyticsName = "myfirstdatalakeanalysis" 
$DLAnalyticsDoP = 10
$DLStoreName = "myfirstdatalakestore01"
 
 
#Create Azure Connection
Login-AzureRmAccount | Out-Null
 
$USQLFile = $WhereAmI + "RecursiveOutputPrep.usql"
$PrepOutput = $WhereAmI + "AmbulanceDataDateList.txt"
 
#Summit Job
$job = Submit-AzureRmDataLakeAnalyticsJob `
    -Name "GetDateList" `
    -AccountName $DLAnalyticsName `
    –ScriptPath $USQLFile `
    -DegreeOfParallelism $DLAnalyticsDoP
 
Write-Host "Submitted USQL prep job."
 
#Wait for job to complete
Wait-AdlJob -Account $DLAnalyticsName -JobId $job.JobId | Out-Null
 
Write-Host "Downloading USQL output file."
 
#Download date list
Export-AzureRmDataLakeStoreItem `
    -AccountName $DLStoreName `
    -Path $myrootdir\output\AmbulanceDataDateList.csv `
    -Destination $PrepOutput | Out-Null
 
Write-Host "Downloaded USQL output file."
 
#Read dates
$Dates = Get-Content $PrepOutput
 
Write-Host "Read date list."
 
#Loop over dates with proc call for each
ForEach ($Date in $Dates)
    {
    $USQLProcCall = '[dbo].[usp_OutputDailyAvgSpeed]("' + $Date + '");'
    $JobName = 'Output daily avg dataset for ' + $Date
 
    Write-Host $USQLProcCall
 
    $job = Submit-AzureRmDataLakeAnalyticsJob `
        -Name $JobName `
        -AccountName $DLAnalyticsName `
        –Script $USQLProcCall `
        -DegreeOfParallelism $DLAnalyticsDoP
 
    Write-Host "Job submitted for " $Date
    }
 
Write-Host "Script complete. USQL jobs running."

At this point I think its worth reminding you of my caveats above 🙂

I would like to point out the flexibility in the PowerShell cmdlet Submit-AzureRmDataLakeAnalyticsJob. Allowing us to pass a U-SQL file (step 2) or build up a U-SQL string dynamically within the PowerShell script and pass that as the execution code (step 7), very handy. Switches: -Script or -ScriptPath.

If all goes well you should have jobs being prepared and shortly after running to produce the daily output files.

I used 10 AU’s for my jobs because I wanted to burn up some old Azure credits, but you can change this in the PowerShell variable $DLAnalyticsDoP.

Conclusion

It’s possible to archive looping behaviour with U-SQL when we want to produce multiple output files, but only when we abstract the iterative behaviour away to our friend PowerShell.

Comments welcome on this approach.

Many thanks for reading.

 

Ps. To make life a little easier. I’ve stuck all of the above code and sample data into a GitHub repostiory to save you copy and pasting things from the code windows above.

https://github.com/mrpaulandrew/RecursiveU-SQLWithPowerShell

 

 


Using Hyper-V and PowerShell to Create the Perfect Developer Workstation

So the second challenge I faced after unboxing and plugging in my new workstation at Purple Frog Systems was software. Having customers using all versions of SQL Server from 2005 onwards I didn’t want to tie myself to a particular version locally and I also didn’t want the hassle of running lots of different SQL Server services all on the same host operating system. Especially if I wanted to use Windows 10 as my host, which as we know would have compatibility issues with SQL Server 2005 and our old friend BIDS.

WindowsFeaturesHyperV

Enter Microsoft Hyper-V

In Windows 8 onwards Hyper-V is available out of the box as a feature which can simply be switched on from Control Panel > Programs and Features > Turn Windows Features On or Off. Apologies in advance if the home editions don’t support this.

Note: be sure to select the Hyper-V Module for Windows PowerShell.

Now with my Windows 10 host now running Hyper-V management services I set about creating a bunch of virtual machines to serve up every version of SQL Server required.

Whenever starting the Hyper-V manager always do so as an administrator. The user interface throws some odd errors without the elevated permissions which could lead you down a rabbit hole of Google searches. Trust me!

Creating a Virtual Switch

For those of you new to Hyper-V creating a virtual switch for your guess virtual machines is one of the most important things to get right and sort out first. It’s very simple to do, but without it your network admins and DHCP police might start questioning what on earth your physical network connection is doing. Again, trust me on this!

HyperVCreateSwitchFrom your Hyper-V Manager toolbar go to Action > Virtual Switch Manager. In the dialogue window that appears in the right hand panel choose the switch type of External and click Create Virtual Switch. Give the switch a name and select the External Network: that the new switch will basically impersonate. This will be a sort list if you only have 1x physical network connection.

 

With an external virtual switch in place any guess machine setup to use it as a network resource will appear on your network in the same way your physical host does. It’s own MAC address, IP address, hostname etc. Also be mindful that Hyper-V will take over how your network connections appear in your host operating system and a CMD ipconfig /all will start to look a little more complex, but this is nothing to worry about and perfectly normal.

Creating the Guest Virtual Machines

To offer some repeatability when creating my guest virtual machines I broke out the PowerShell, hence including the module at install time. With a few parameterised cmdlets I was able to setup several empty virtual machines ready for my guest operating systems. Example below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#Create hyperv machine
New-VM `
    -Name $VMName `
    -Path $VMLocation `
    -MemoryStartupBytes $RAM `
    -NewVHDPath $VHDOSFile `
    -NewVHDSizeBytes $VHDSizes `
    -SwitchName $NetworkSwitch `
    -Generation $Generation
 
#Set Number of CPUs
Set-VMProcessor `
    -VMName $VMName `
    -Count $CPUs
 
#Create data disk
New-VHD `
    –Path $VHDDataFile -Dynamic `
    –SizeBytes $VHDSizes `

Download the full script I used here.

Note: I’m not a PowerShell expert, just a big fan of it’s uses.

You’ll see in the full script that I give my virtual machines 2x hard drive files each on different controllers (IDE and SCSI), which are in turn are created on different host hard drives, mainly because my host has 3x physical disks. My original intention was to have all operating systems on mechanical disks and all data on solid state disks. So for guess SQL Server installs the virtual machine data disk would be used to house databases attached to the instance. In practice this didn’t really pay off but then this is just another advantage of using Hyper-V, that the underlying resources can be moved around as required.

If you have time I would recommend having you ISO’s on a separate physical disk to the guess hard drive files. This greatly speeds up installation. You could even run the guess OS hard drive files on your solid state drive just for installation, then move them to mechanical disks afterwards.

The Guests

Assuming you now have a bunch of virtual machine shells in place I’d recommend the following operating systems vs versions of SQL Server. Mainly to avoid any compatibility issues.

  • SQL Server 2005 – Windows Server 2003
  • SQL Server 2008 – Windows Server 2008
  • SQL Server 2012 – Windows Server 2012
  • SQL Server 2014 – Windows Server 2012
  • SQL Server 2016 – Windows Server 2016 (assuming the OS is released in time)

I’m not working to talk about installing Windows and SQL Server on each guest. I’m assuming your familiar with that Next, Next, Next process (if you want the defaults!). However I would say that for SQL Server 2012 onwards I did this twice for Analysis services to give me both Tabular and Multi Dimensional services in my guess virtual machine.

The Goal

If you’ve worked through this post and setup your host in the same way I have you’ll now be able to enjoy the advantages and flexibility of running all versions of SQL Server at the same time with all SQL services in guess virtual machines on your developer worksation.

HyperVVMList

My virtual machines are also setup with dynamic RAM:

  • Initial value: 2048 MB
  • Min value: 512 MB
  • Max value: 8192 MB

The Memory Weight slider is great if you want to keep all your guess virtual machines running at the same time like I do. If doing some development work on SQL Server 2005 I simply increase that guests RAM priority which dynamically adjusts the less used virtual machines so the host doesn’t get overwhelmed. Plus you’ll only be developing on one version of SQL Server at once right?!

HyperVRAMWeight

The last thing you’ll probably want to do is install some SQL Server client tools on your Windows 10 host. I went for SQL Server 2012, plus Visual Studio 2012. Then SQL Server 2014, plus Visual Studio 2015. Installed in that order.

Final Words

Please be aware that the above was only possible with a Microsoft MSDN subscription which provided licencing for the guest operating systems and older versions of SQL Server.

During this setup I also had the advantage of being a domain admin, which meant creating things like service accounts for all my virtual machines was easy, adding my virtual machines to the domain and accessing SQL Server services from my host using AD authentication made easy. If you not a DA SQL authentication and local workgroups is ok, but has its limits for SSAS.

RDCMAddMachine

You may want to try using Remote Desktop Connection Manager to access your guess operating systems. In the latest version a VM ID can be including giving console style access without needing to connect to each guess from the Hyper-V manager.

VM IDs can be found with the following bit of PowerShell.

1
Get-VMMemory -VMName "PF*" | Select VMName, ID | Sort-Object VMName

Very last thing, I mentioned my host machine had 3x physical hard drives. The third of which is a huge 4 TB block. To keep my guest operating systems fairly customer independent you’ll have seen I only gave the standard virtual hard drives in PowerShell 100GB of space each. What I then do is give a guest an additional virtual disk much larger in size which resides on the host big data volume. Work is done here. Then post project completion this customer specific virtual data disk can just be dropped/archives/moved away and the guest machine ready for the next work item. It’s of course a management overhead, but helps keep things clean.

Many thanks for reading.


Paul’s Frog Blog

Paul is a Microsoft Data Platform MVP with 10+ years’ experience working with the complete on premises SQL Server stack in a variety of roles and industries. Now as the Business Intelligence Consultant at Purple Frog Systems has turned his keyboard to big data solutions in the Microsoft cloud. Specialising in Azure Data Lake Analytics, Azure Data Factory, Azure Stream Analytics, Event Hubs and IoT. Paul is also a STEM Ambassador for the networking education in schools’ programme, PASS chapter leader for the Microsoft Data Platform Group – Birmingham, SQL Bits, SQL Relay, SQL Saturday speaker and helper. Currently the Stack Overflow top user for Azure Data Factory. As well as very active member of the technical community.
Thanks for visiting.
@mrpaulandrew