Monthly Archives: December 2016

Writing a U-SQL Merge Statement

Unlike T-SQL, U-SQL does not currently support MERGE statements. Our friend that we have come to know and love since its introduction in SQL Server 2008. Not only that, but U-SQL also doesn’t currently support UPDATE statements either… I know… Open mouth emoji required! This immediately leads to the problem of change detection in our data and how, for example, we should handle the ingestion of a daily rolling 28-day TSV extract, requiring a complete year to date output. Well in this post we will solve that very problem.

Now before we move on it’s worth pointing out that U-SQL is now our/my language of choice for working with big data in Azure using the Data Lake Analytics service. It’s not yet the way of things for our on premises SQL Server databases, so relax. T-SQL or R are still our out-of-the-box tools there (SQL Server 2016). Also if you want to take a step back from this slightly deeper U-SQL topic and find out What is U-SQL first, I can recommend my purple amphibious colleagues blog, link below.

https://www.purplefrogsystems.com/blog/2016/02/what-is-u-sql/

Assuming you’re comfortable with the U-SQL basics let’s move on. For the below examples I’m working with Azure Data Lake (ADL) Analytics, deployed in my Azure MSDN subscription. Although we can do everything here in the local Visual Studio emulator, without the cloud service (very cool and handy for development). I also have the Visual Studio Data Lake tools for the service available and installed. Specifically for this topic I have created a ‘U-SQL Sample Application’ project to get us started. This is simply for ease of explanation and so you can get most of the setup code for what I’m doing here without any great difficulty. Visual Studio Data Lake tools download link below if needed.

https://www.microsoft.com/en-us/download/details.aspx?id=49504

newdatalakevssampleapp

Once we have this solution available including its Ambulance and Search Log samples please find where your Visual Studio Cloud Explorer panel (Ctrl + \, Ctrl + X) is hiding as we’ll use this to access the local Data Lake Analytics database on your machine.

vscloudandsolutionpanelsDatabase Objects

To get things rolling open the U-SQL file from the sample project called ‘SearchLog-4-CreatingTable’ and execute AKA ‘Submit’ this to run locally against your ADL Analytics instance. This gives us a database and target table to work with for the merge operation. It also inserts some tab separated sample data into the table. If we don’t insert this initial dataset you’ll find joining to an empty table will prove troublesome.

Now U-SQL is all about extracting and outputting at scale. There isn’t any syntax sugar to merge datasets. But do we really need the sugar? No. So, we are going to use a database table as our archive or holding area to ensure we get the desired ‘upsert’ behaviour. Then write our MERGE statement long hand using a series of conventional joins. Not syntax sugar. Just good old fashioned joining of datasets to get the old, the new and the changed.

Recap of the scenario; we have a daily rolling 28-day input, requiring a full year to date output.

Merging Data

Next open the U-SQL file from the sample project called ‘SearchLog-1-First_U-SQL_Script’. This is a reasonable template to adapt as it contains the EXTRACT and OUTPUT code blocks already.

For the MERGE we next need a set of three SELECT statements joining both the EXTRACT (new/changed data) and table (old data) together. These are as follows, in mostly English type syntax first 🙂

  • For the UPDATE, we’ll do an INNER JOIN. Table to file. Taking fields from the EXTRACT only.
  • For the INSERT, we’ll do a LEFT OUTER JOIN. File to table. Taking fields from the EXTRACT where NULL in the table.
  • To retain old data, we’ll do a RIGHT OUTER JOIN. File to table. Taking fields from the table where NULL in the file.

Each of the three SELECT statements can then have UNION ALL conditions between them to form a complete dataset including any changed values, new values and old values loaded by a previous file. This is the code you’ll want to add for the example in your open file between the extract and output code blocks. Please don’t just copy and paste without understanding it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
@AllData =
    --update current
    SELECT e1.[UserId],
           e1.[START],
           e1.[Region],
           e1.[Query],
           e1.[Duration],
           e1.[Urls],
           e1.[ClickedUrls]
    FROM [SearchLogDemo].[dbo].[SearchLog] AS t1
         INNER JOIN
             @searchlog AS e1
         ON t1.[UserId] == e1.[UserId]
 
    UNION ALL
 
    --insert new
    SELECT e2.[UserId],
           e2.[START],
           e2.[Region],
           e2.[Query],
           e2.[Duration],
           e2.[Urls],
           e2.[ClickedUrls]
    FROM @searchlog AS e2
         LEFT OUTER JOIN
             [SearchLogDemo].[dbo].[SearchLog] AS t2
         ON t2.[UserId] == e2.[UserId]
    WHERE
    t2.[UserId] IS NULL
 
    UNION ALL
 
    --keep existing
    SELECT t3.[UserId],
           t3.[START],
           t3.[Region],
           t3.[Query],
           t3.[Duration],
           t3.[Urls],
           t3.[ClickedUrls]
    FROM @searchlog AS e3
         RIGHT OUTER JOIN
             [SearchLogDemo].[dbo].[SearchLog] AS t3
         ON t3.[UserId] == e3.[UserId]
    WHERE
    e3.[UserId] IS NULL;

This union of data can then OUTPUT to our usable destination doing what U-SQL does well before resetting our ADL Analytics table for the next load. By reset, I mean TRUNCATE the table and INSERT everything from @AllData back into it. This preserves our history/our old data and allows the MERGE behaviour to work again and again using only SELECT statements.

Replacing the OUTPUT variable from @searchlog, you’ll then want to add the following code below the three SELECT statements.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
OUTPUT @AllData
TO "/output/SearchLogAllData.csv"
USING Outputters.Csv();
 
TRUNCATE TABLE [SearchLogDemo].[dbo].[SearchLog];
 
INSERT INTO [SearchLogDemo].[dbo].[SearchLog]
(
    [UserId],
    [START],
    [Region],
    [Query],
    [Duration],
    [Urls],
    [ClickedUrls]
)
SELECT [UserId],
       [START],
       [Region],
       [Query],
       [Duration],
       [Urls],
       [ClickedUrls]
FROM @AllData;

If all goes well you can edit the ‘SearchLog.tsv’ file removing and changing data and keep rerunning the U-SQL script performing the MERGE behaviour. Please test away. Don’t just believe me that it works. As a bonus you get this pretty job diagram too…

localusqlmergejob

The only caveat here is that we can’t deal with deletion detection from the source file… Unless we do something a little more complex for the current loading period. Lets save that for a later blog post.

A couple of follow up general tips.

  • Have a USE condition at the top of your scripts to ensure you hit the correct database. Just like T-SQL.
  • If your struggling for fields to join on as you don’t have a primary key. You could use UNION instead of UNION ALL. But this of course takes more effort to work out the distinct values. Just like T-SQL.
  • Be careful with C# data types and case sensitivity. U-SQL is not as casual as T-SQL with such things.

That’s it. U-SQL merge behaviour achieved. I guess the bigger lesson here for the many T-SQL people out there is; don’t forget the basics, its still a structured query language. Syntax sugar is sweet, but not essential.

Hope this was helpful.

Many thanks for reading


Azure Data Lake Authentication from Azure Data Factory

rereTo set the scene for the title of this blog post lets firstly think about other services within Azure. You’ll probably already know that most services deployed require authentication via some form of connection string and generated key. These keys can be granted various levels of access and also recycled as required, for example an IoT Event Hub seen below (my favourite service to play with).

levelskeysandconnectionstrings

Then we have other services like SQLDB that require user credentials to authenticate as we would expect from the on premises version of the product. Finally we have a few other Azure services that handle authentication in a very different way altogether requiring both user credentials initially and then giving us session and token keys to be used by callers. These session and token keys are a lot more fragile than connection strings and can expire or become invalid if the calling service gets rebuilt or redeployed.

In this blog post I’ll explore and demonstrate specifically how we handle session and token based authentication for Azure Data Lake (ADL), firstly when calling it as a Linked Service from Azure Data Factory (ADF), then secondly within ADF custom activities. The latter of these two ADF based operations becomes a little more difficult because the .Net code created and compiled is unfortunately treated as a distant relative to ADF requiring its own authentication to ADL storage as an Azure Application. To further clarify a Custom Activity in ADF does not inherit its authorising credentials from the parent Linked Service, it is responsible for its own session/token. Why? Because as you may know from reading my previous blog post; Custom Activates get complied and executed by an Azure Batch Service. Meaning the compute for the .Net code is very much detached from ADF.

At this point I should mention that this applies to Data Lake Analytics and Data Lake Storage. Both require the same approach to authentication.

Data Lake as a Service Within Data Factory

adf-author-and-deploy-buttonThe easy one first, adding an Azure Data Lake service to your Data Factory pipeline. From the Azure portal within the ADF Author and Deploy blade you simply add a new Data Lake Linked Service which returns a JSON template for the operation into the right hand panel. Then we kindly get provided with an Authorize button (spelt wrong) at the top of the code block.

Clicking this will pop up with the standard Microsoft login screen requesting work or personal user details etc. Upon competition or successful authentication your Azure subscription will be inspected. If more than one applicable service exists, you’ll of course need to select which you require authorisation for. But once done you’ll return to the JSON template now with a completed Authorization value and SessionId.

adf-adl-json-templateJust for information and to give you some idea of the differences in this type of authorisation compared to other Azure services. When I performed this task for the purpose of creating screen shots in this post the resulting Authorization URL was 1219 characters long and the returned SessionId was 1100! Or half a page of a standard Word document each. By comparison an IoT Hub key is only 44 characters. Furthermore, the two values are customised to the service that requested them and can only be used within the context where they were created.

For completeness, because we can also now develop ADF pipelines from Visual Studio it’s worth knowing that a similar operation is now available as part of the Data Factory extension. In Visual Studio within your ADF project on the Linked Service branch you are able to Right Click > Add > New Item and choose Data Lake Store or Analytics. You’ll then be taken through a wizard (similar in look to that of the ADF deployment wizard) which requests user details, the ADF context and returns the same JSON template with populated authorising values.

vs-adf-adl-addservice

A couple of follow up tips and lessons learnt here:

  • If you tell Visual Studio to reverse engineer your ADF pipeline from a current Azure deployed factory where an existing ADL token and session ID are available. These will not be brought into Visual Studio and you’ll need to authorise the service again.
  • If you copy an ADL JSON template from the Azure portal ‘Author and Deploy’ area Visual Studio will not popup the wizard to authorise the service and you’ll need to do it again.
  • If you delete the ADL Linked Service within the portal ‘Author and Deploy’ area. The same Linked Service tokens in Visual Studio will become invalid and you’ll need to authorise the service again.
  • If you sneeze to loudly while Visual Studio is open you’ll need to authorise the service again.

Do you get the idea when I said earlier that the authorisation method is fragile? Very sophisticated, but fragile when chopping and changes things during development.

What you may find yourself doing fairly frequently is:

  1. Deploying an ADF project from Visual Studio.
  2. The deployment wizard failing telling you the ADL tokens have expired or are no longer authorised.
  3. Adding a new Linked Service to the project just to get the user authentication wizard popup.
  4. Then copying the new token and session values into the existing ADL Linked Service JSON file.
  5. Then excluding the new services you created just to re-authorise from the Visual Studio project.

Fun! Moving on.

Update: you can use an Azure AD service principal to authenticate both Azure Data Lake Store and Azure Data Lake Analytics services from ADF. Details are included in this post: https://docs.microsoft.com/en-gb/azure/data-factory/v1/data-factory-azure-datalake-connector#azure-data-lake-store-linked-service-properties

Data Factory Custom Activity Call Data Lake

Next the slightly more difficult way to authenticate against ADL, using an ADF .Net Custom Activity. As mentioned previously the .Net code once sent to Azure as a DLL is treated as a third party application requiring its own credentials.

The easiest way I’ve found to getting this working is firstly to use PowerShell to register the application in Azure which using the correct CMDLets returns an application GUID and password which when combined give the .Net code its credentials. Here’s the PowerShell you’ll need below. Be sure you run this with elevated permissions locally.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Sign in to Azure.
Add-AzureRmAccount
 
#Set this variables
$appName = "SomeNameThatYouWillRegoniseInThePortal"
$uri = "AValidURIAlthoughNotApplicableForThis"
$secret = "SomePasswordForTheApplication"
 
# Create a AAD app
$azureAdApplication = New-AzureRmADApplication `
    -DisplayName $appName `
    -HomePage $Uri `
    -IdentifierUris $Uri `
    -Password $secret
 
# Create a Service Principal for the app
$svcprincipal = New-AzureRmADServicePrincipal -ApplicationId $azureAdApplication.ApplicationId
 
# To avoid a PrincipalNotFound error, I pause here for 15 seconds.
Start-Sleep -s 15
 
# If you still get a PrincipalNotFound error, then rerun the following until successful. 
$roleassignment = New-AzureRmRoleAssignment `
    -RoleDefinitionName Contributor `
    -ServicePrincipalName $azureAdApplication.ApplicationId.Guid
 
# The stuff you want:
 
Write-Output "Copy these values into the C# sample app"
 
Write-Output "_subscriptionId:" (Get-AzureRmContext).Subscription.SubscriptionId
Write-Output "_tenantId:" (Get-AzureRmContext).Tenant.TenantId
Write-Output "_applicationId:" $azureAdApplication.ApplicationId.Guid
Write-Output "_applicationSecret:" $secret
Write-Output "_environmentName:" (Get-AzureRmContext).Environment.Name

My recommendation here is to take the returned values and store that in something like the Class Library settings, available from the Visual Studio project properties. Don’t store them as constants at the top of your Class as its highly likely you’ll need them multiple times.

Next, what to do with the application GUID etc. Well in your Custom Activity C# will need something like the following. Apologies for dumping massive code blocks into this post, but you will need all of this in your Class if you want to use details from your ADF service and work with ADL files.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
class SomeCustomActivity : IDotNetActivity
{
	//Get credentials for app
	string domainName = Settings.Default.AzureDomainName;
	string appId = Settings.Default.ExcelExtractorAppId; //From PowerShell <<<<<
	string appPass = Settings.Default.ExceExtractorAppPass; //From PowerShell <<<<<
	string appName = Settings.Default.ExceExtractorAppName; //From PowerShell <<<<<
 
	private static DataLakeStoreFileSystemManagementClient adlsFileSystemClient;
	//and or:
	private static DataLakeStoreAccountManagementClient adlsAccountManagerClient;
 
	public IDictionary<string, string> Execute(
		IEnumerable linkedServices,
		IEnumerable datasets,
		Activity activity,
		IActivityLogger logger)
	{
		//Get linked service details from Data Factory
		Dataset inputDataset = new Dataset();
		inputDataset = datasets.Single(dataset => 
			dataset.Name == activity.Inputs.Single().Name);
 
		AzureDataLakeStoreLinkedService inputLinkedService;
 
		inputLinkedService = linkedServices.First(
			linkedService =>
			linkedService.Name ==
			inputDataset.Properties.LinkedServiceName).Properties.TypeProperties
			as AzureDataLakeStoreLinkedService;
 
		//Get account name for data lake and create credentials for app
		var creds = AuthenticateAzure(domainName, appId, appPass);
		string accountName = inputLinkedService.AccountName;
 
		//Authorise new instance of Data Lake Store
		adlsFileSystemClient = new DataLakeStoreFileSystemManagementClient(creds);
 
		/*
			DO STUFF...
 
			using (Stream input = adlsFileSystemClient.FileSystem.Open
				(accountName, completeInputPath)
				)
		*/	
 
 
		return new Dictionary<string, string>();
	}
 
 
	private static ServiceClientCredentials AuthenticateAzure
		(string domainName, string clientID, string clientSecret)
	{
		SynchronizationContext.SetSynchronizationContext(new SynchronizationContext());
 
		var clientCredential = new ClientCredential(clientID, clientSecret);
		return ApplicationTokenProvider.LoginSilentAsync(domainName, clientCredential).Result;
	}
}

Finally, before you execute anything be sure to grant the Azure app permissions to the respective Data Lake service. In the case of the Data Lake Store. From the portal you can use the Data Explorer blades to assign folder permissions.

adl-grant-permissions

I really hope this post has saved you some time in figuring out how to authorise Data Lake services from Data Factory. Especially when developing beyond what the ADF Copy Wizard gives you.

Many thanks for reading.


Paul’s Frog Blog

Paul is a Microsoft Data Platform MVP with 10+ years’ experience working with the complete on premises SQL Server stack in a variety of roles and industries. Now as the Business Intelligence Consultant at Purple Frog Systems has turned his keyboard to big data solutions in the Microsoft cloud. Specialising in Azure Data Lake Analytics, Azure Data Factory, Azure Stream Analytics, Event Hubs and IoT. Paul is also a STEM Ambassador for the networking education in schools’ programme, PASS chapter leader for the Microsoft Data Platform Group – Birmingham, SQL Bits, SQL Relay, SQL Saturday speaker and helper. Currently the Stack Overflow top user for Azure Data Factory. As well as very active member of the technical community.
Thanks for visiting.
@mrpaulandrew