rereTo set the scene for the title of this blog post lets firstly think about other services within Azure. You’ll probably already know that most services deployed require authentication via some form of connection string and generated key. These keys can be granted various levels of access and also recycled as required, for example an IoT Event Hub seen below (my favourite service to play with).

levelskeysandconnectionstrings

Then we have other services like SQLDB that require user credentials to authenticate as we would expect from the on premises version of the product. Finally we have a few other Azure services that handle authentication in a very different way altogether requiring both user credentials initially and then giving us session and token keys to be used by callers. These session and token keys are a lot more fragile than connection strings and can expire or become invalid if the calling service gets rebuilt or redeployed.

In this blog post I’ll explore and demonstrate specifically how we handle session and token based authentication for Azure Data Lake (ADL), firstly when calling it as a Linked Service from Azure Data Factory (ADF), then secondly within ADF custom activities. The latter of these two ADF based operations becomes a little more difficult because the .Net code created and compiled is unfortunately treated as a distant relative to ADF requiring its own authentication to ADL storage as an Azure Application. To further clarify a Custom Activity in ADF does not inherit its authorising credentials from the parent Linked Service, it is responsible for its own session/token. Why? Because as you may know from reading my previous blog post; Custom Activates get complied and executed by an Azure Batch Service. Meaning the compute for the .Net code is very much detached from ADF.

At this point I should mention that this applies to Data Lake Analytics and Data Lake Storage. Both require the same approach to authentication.

Data Lake as a Service Within Data Factory

adf-author-and-deploy-buttonThe easy one first, adding an Azure Data Lake service to your Data Factory pipeline. From the Azure portal within the ADF Author and Deploy blade you simply add a new Data Lake Linked Service which returns a JSON template for the operation into the right hand panel. Then we kindly get provided with an Authorize button (spelt wrong) at the top of the code block.

Clicking this will pop up with the standard Microsoft login screen requesting work or personal user details etc. Upon competition or successful authentication your Azure subscription will be inspected. If more than one applicable service exists, you’ll of course need to select which you require authorisation for. But once done you’ll return to the JSON template now with a completed Authorization value and SessionId.

adf-adl-json-templateJust for information and to give you some idea of the differences in this type of authorisation compared to other Azure services. When I performed this task for the purpose of creating screen shots in this post the resulting Authorization URL was 1219 characters long and the returned SessionId was 1100! Or half a page of a standard Word document each. By comparison an IoT Hub key is only 44 characters. Furthermore, the two values are customised to the service that requested them and can only be used within the context where they were created.

For completeness, because we can also now develop ADF pipelines from Visual Studio it’s worth knowing that a similar operation is now available as part of the Data Factory extension. In Visual Studio within your ADF project on the Linked Service branch you are able to Right Click > Add > New Item and choose Data Lake Store or Analytics. You’ll then be taken through a wizard (similar in look to that of the ADF deployment wizard) which requests user details, the ADF context and returns the same JSON template with populated authorising values.

vs-adf-adl-addservice

A couple of follow up tips and lessons learnt here:

  • If you tell Visual Studio to reverse engineer your ADF pipeline from a current Azure deployed factory where an existing ADL token and session ID are available. These will not be brought into Visual Studio and you’ll need to authorise the service again.
  • If you copy an ADL JSON template from the Azure portal ‘Author and Deploy’ area Visual Studio will not popup the wizard to authorise the service and you’ll need to do it again.
  • If you delete the ADL Linked Service within the portal ‘Author and Deploy’ area. The same Linked Service tokens in Visual Studio will become invalid and you’ll need to authorise the service again.
  • If you sneeze to loudly while Visual Studio is open you’ll need to authorise the service again.

Do you get the idea when I said earlier that the authorisation method is fragile? Very sophisticated, but fragile when chopping and changes things during development.

What you may find yourself doing fairly frequently is:

  1. Deploying an ADF project from Visual Studio.
  2. The deployment wizard failing telling you the ADL tokens have expired or are no longer authorised.
  3. Adding a new Linked Service to the project just to get the user authentication wizard popup.
  4. Then copying the new token and session values into the existing ADL Linked Service JSON file.
  5. Then excluding the new services you created just to re-authorise from the Visual Studio project.

Fun! Moving on.

Update: you can use an Azure AD service principal to authenticate both Azure Data Lake Store and Azure Data Lake Analytics services from ADF. Details are included in this post: https://docs.microsoft.com/en-gb/azure/data-factory/v1/data-factory-azure-datalake-connector#azure-data-lake-store-linked-service-properties

Data Factory Custom Activity Call Data Lake

Next the slightly more difficult way to authenticate against ADL, using an ADF .Net Custom Activity. As mentioned previously the .Net code once sent to Azure as a DLL is treated as a third party application requiring its own credentials.

The easiest way I’ve found to getting this working is firstly to use PowerShell to register the application in Azure which using the correct CMDLets returns an application GUID and password which when combined give the .Net code its credentials. Here’s the PowerShell you’ll need below. Be sure you run this with elevated permissions locally.

# Sign in to Azure.
Add-AzureRmAccount

#Set this variables
$appName = "SomeNameThatYouWillRegoniseInThePortal"
$uri = "AValidURIAlthoughNotApplicableForThis"
$secret = "SomePasswordForTheApplication"

# Create a AAD app
$azureAdApplication = New-AzureRmADApplication `
    -DisplayName $appName `
    -HomePage $Uri `
    -IdentifierUris $Uri `
    -Password $secret

# Create a Service Principal for the app
$svcprincipal = New-AzureRmADServicePrincipal -ApplicationId $azureAdApplication.ApplicationId

# To avoid a PrincipalNotFound error, I pause here for 15 seconds.
Start-Sleep -s 15

# If you still get a PrincipalNotFound error, then rerun the following until successful. 
$roleassignment = New-AzureRmRoleAssignment `
    -RoleDefinitionName Contributor `
    -ServicePrincipalName $azureAdApplication.ApplicationId.Guid

# The stuff you want:

Write-Output "Copy these values into the C# sample app"

Write-Output "_subscriptionId:" (Get-AzureRmContext).Subscription.SubscriptionId
Write-Output "_tenantId:" (Get-AzureRmContext).Tenant.TenantId
Write-Output "_applicationId:" $azureAdApplication.ApplicationId.Guid
Write-Output "_applicationSecret:" $secret
Write-Output "_environmentName:" (Get-AzureRmContext).Environment.Name

My recommendation here is to take the returned values and store that in something like the Class Library settings, available from the Visual Studio project properties. Don’t store them as constants at the top of your Class as its highly likely you’ll need them multiple times.

Next, what to do with the application GUID etc. Well in your Custom Activity C# will need something like the following. Apologies for dumping massive code blocks into this post, but you will need all of this in your Class if you want to use details from your ADF service and work with ADL files.

class SomeCustomActivity : IDotNetActivity
{
	//Get credentials for app
	string domainName = Settings.Default.AzureDomainName;
	string appId = Settings.Default.ExcelExtractorAppId; //From PowerShell <<<<<
	string appPass = Settings.Default.ExceExtractorAppPass; //From PowerShell <<<<<
	string appName = Settings.Default.ExceExtractorAppName; //From PowerShell <<<<<

	private static DataLakeStoreFileSystemManagementClient adlsFileSystemClient;
	//and or:
	private static DataLakeStoreAccountManagementClient adlsAccountManagerClient;
	
	public IDictionary<string, string> Execute(
		IEnumerable linkedServices,
		IEnumerable datasets,
		Activity activity,
		IActivityLogger logger)
	{
		//Get linked service details from Data Factory
		Dataset inputDataset = new Dataset();
		inputDataset = datasets.Single(dataset => 
			dataset.Name == activity.Inputs.Single().Name);
		
		AzureDataLakeStoreLinkedService inputLinkedService;
		
		inputLinkedService = linkedServices.First(
			linkedService =>
			linkedService.Name ==
			inputDataset.Properties.LinkedServiceName).Properties.TypeProperties
			as AzureDataLakeStoreLinkedService;
		
		//Get account name for data lake and create credentials for app
		var creds = AuthenticateAzure(domainName, appId, appPass);
		string accountName = inputLinkedService.AccountName;
		
		//Authorise new instance of Data Lake Store
		adlsFileSystemClient = new DataLakeStoreFileSystemManagementClient(creds);
		
		/*
			DO STUFF...
			
			using (Stream input = adlsFileSystemClient.FileSystem.Open
				(accountName, completeInputPath)
				)
		*/	
		
		
		return new Dictionary<string, string>();
	}
	
	
	private static ServiceClientCredentials AuthenticateAzure
		(string domainName, string clientID, string clientSecret)
	{
		SynchronizationContext.SetSynchronizationContext(new SynchronizationContext());

		var clientCredential = new ClientCredential(clientID, clientSecret);
		return ApplicationTokenProvider.LoginSilentAsync(domainName, clientCredential).Result;
	}
}

Finally, before you execute anything be sure to grant the Azure app permissions to the respective Data Lake service. In the case of the Data Lake Store. From the portal you can use the Data Explorer blades to assign folder permissions.

adl-grant-permissions

I really hope this post has saved you some time in figuring out how to authorise Data Lake services from Data Factory. Especially when developing beyond what the ADF Copy Wizard gives you.

Many thanks for reading.


Tags: , , , ,