Azure Data Lake

Business Intelligence in Azure – SQLBits 2018 Precon

What can you expect from my SQLBits pre conference training day in February 2018 at the London Olympia?

Well my friends, in short, we are going to take whirlwind tour of the entire business intelligence stack of services in Azure. No stone will be left unturned. No service will be left without scalability. We’ll cover them all and we certainly aren’t going to check with the Azure bill payer before turning up the compute on our data transforms.



What will we actually cover?

With new cloud services and advancements in locally hosted platforms developing a lambda architecture is becoming the new normal. In this full day of high level training we’ll learn how to architect hybrid business intelligence solutions using Microsoft Azure offerings. We’ll explore the roles of these cloud data services and how to make them work for you in this complete overview of business intelligence on the Microsoft cloud data platform.

Here’s how we’ll break that down during the day…

Module 1 – Getting Started with Azure

Using platform as a service products is great, but let’s take a step back. To kick off we’ll cover the basics for deploying and managing your Azure services. Navigating the Azure portal and building dashboards isn’t always as intuitive as we’d like. What’s a resource group? And why is it important to understand your Azure Activity Directory tenant?

Module 2 – An Overview of BI in Azure

What’s available for the business intelligence architect in the cloud and how might these services relate to traditional on premises ETL and cube data flows. Is ETL enough for our big unstructured data sources or do we need to mix things up and add some more letters to the acronym in the cloud?

Module 3 – Databases in Azure (SQL DB, SQL DW, Cosmos DB, SQL MI)

It’s SQL Server Jim, but not as we know it. Check out the PaaS flavours of our long term on premises friends. Can we trade the agent and an operating system for that sliding bar of scalable compute? DTU and DWU are here to stay with new SLA’s relating to throughput. Who’s on ACID and as BI people do we care?

Module 4 – The Azure Machines are here to Learn

Data scientist or developer? Azure Machine Learning was designed for applied machine learning. Use best-in-class algorithms in a simple drag-and-drop interface. We’ll go from idea to deployment in a matter of clicks. Without a terminator in sight!

Module 5 – Swimming in the Data Lake with U-SQL

Let’s understand the role of this hyper-scale two tier big data technology and how to harness its power with U-SQL, the offspring of T-SQL and C#. We’ll cover everything you need to know to get started developing solutions with Azure Data Lake.

Module 6 – IoT, Event Hubs and Azure Stream Analytics

Real-time data is everywhere. We need to use it and unlock it as a rich source of information that can be channelled to react to events, produce alerts from sensor values or in 9000 other scenarios. In this module, we’ll learn how, using Azure messaging hubs and Azure Stream Analytics.

Module 7 – Power BI, our Sematic Layer, is it All Things to All People?

Combining all our data sources in one place with rich visuals and a flexible data modelling tool. Power BI takes it all, small data, big data, streaming data, website content and more. But we really need a Venn diagram to decide when/where it’s needed.

Module 8 – Data Integration with Azure Data Factory and SSIS

The new integration runtime is here. But how do we unlock the scale out potential of our control flow and data flow? Let’s learn to create the perfect dependency driven pipeline for our data flows. Plus, how to work with the Azure Batch Service should you need that extensibility.

 

Finally we’ll wrap up the day by playing the Azure icon game, which you’ll all now be familiar with and able to complete with a perfect score having completed this training day 🙂

Many thanks for reading and I hope to see you in February, its going to be magic 😉

Register now: https://www.regonline.com/registration/Checkin.aspx?EventID=2023328

All training day content is subject to change, dependant on timings and the demo gods will!


 

Azure Data Lake – The Services. The U-SQL. The C# (Reference Guide)

This post is a reference guide to support an event talk or webinar. The content is intended to assist the audience only. Thank you.

Abstract

How do we implement Azure Data Lake? How does a lake fit into our data platform architecture? Is Data Lake going to run in isolation or be part of a larger pipeline? How do we use and work with USQL? Does size matter?! The answers to all these questions and more in this session as we immerse ourselves in the lake, that’s in a cloud. We’ll take an end to end look at the components and understand why the compute and storage are separate services. For the developers, what tools should we be using and where should we deploy our USQL scripts. Also, what options are available for handling our C# code behind and supporting assemblies. We’ll cover everything you need to know to get started developing data solutions with Azure Data Lake. Finally, let’s extend the U-SQL capabilities with the Microsoft Cognitive Services!

Links

What is Azure Data Lake? The Microsoft version.
https://azure.microsoft.com/en-gb/solutions/data-lake/

Understanding the ADL Analytics Unit
https://blogs.msdn.microsoft.com/azuredatalake/2016/10/12/understanding-adl-analytics-unit/

Why use Azure Data Lake? The Microsoft version.
https://azure.microsoft.com/en-gb/solutions/data-lake/

Comsuming Data Lake with Power – Cross tenant data refreshes.
https://www.purplefrogsystems.com/paul/2017/06/connecting-power-bi-to-azure-data-lake-store-across-tenants/

U-SQL String Data Type 128KB Limit
https://feedback.azure.com/forums/327234-data-lake/suggestions/13416093-usql-string-data-type-has-a-size-limit-of-128kb

Creating a U-SQL Merge Statement
https://www.purplefrogsystems.com/paul/2016/12/writing-a-u-sql-merge-statement/

U-SQL Looping
https://www.purplefrogsystems.com/paul/2017/05/recursive-u-sql-with-powershell-u-sql-looping/

U-SQL Date Dimension
https://www.purplefrogsystems.com/paul/2017/02/creating-a-u-sql-date-dimension-numbers-table-in-azure-data-lake/

Further Reading

Microsoft Blog – An Introduction to U-SQL in Azure Data Lake
https://blogs.msdn.microsoft.com/robinlester/2016/01/04/an-introduction-to-u-sql-in-azure-data-lake/

Microsoft Documentation – U-SQL Programmability Guide
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-programmability-guide

Microsoft MSDN – U-SQL Language Reference
https://msdn.microsoft.com/en-US/library/azure/mt591959(Azure.100).aspx

SQL Server Central – Stairway to U-SQL
http://www.sqlservercentral.com/stairway/142480/

Stack Overflow – U-SQL Tag
http://stackoverflow.com/questions/tagged/u-sql

 

Cognitive services with U-SQL in Azure Data Lake

https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-cognitive


Connecting PowerBI.com to Azure Data Lake Store – Across Tenants

Welcome readers, this is a post to define a problem that shouldn’t exist. But sadly, does exist and given its relative complexity I think warrants some explanation. Plus, I’ve included details of what you can currently do if you encounter it.

First some background…

Power BI Desktop

With the recent update to the Power BI desktop application we now find the Azure Data Lake Store connector has finally relinquished its ‘(Beta)’ status and is considered GA. This is good news, but doesn’t make any difference to those of us that have already be storing our outputs as files in Data Lake Store.

The connector as before can be supplied with any storage ADL:// URL and set of credentials using the desktop Power BI application. Our local machines are of course external to the concept and context of a Microsoft Cloud tenant and directory. To reiterate, this means any Data Lake Store anywhere can be queried and datasets refreshed using local tools. It doesn’t even matter about Personal vs Work/School accounts.

This hopefully sets the scene for this posts and starts to allude to the problem your likely to encounter if you want to use your developed visuals beyond your own computer.

PowerBI.com

In this scenario, we’ve developed our Power BI workbook in the desktop product and hit publish. Armed with a valid Office 365/Power BI account the visuals, initial working data, model, measures and connection details for the data source get transferred to the web service version of Power BI, known as PowerBI.com. So far so good.

Next, you want to share and automatically refresh the data, meaning your audience have the latest data at the point of viewing, given a reasonable schedule.

Sharing, no problem at all, assuming you understand the latest Power BI Premium/Free apps, packs, workspace licencing stuff!… A post for another time. Maybe.

Automatic dataset refreshes, not so simple. This expects several ducks to all be lined up exactly. By ducks I mean your Azure Subscription and Office 365 tenant. If they aren’t and one little ducky has strayed from the pack/group/heard (what’s a collection of ducks?). This is want you’ll encounter.

Failed to update data source credentials: The credentials provided for the DataLake source are invalid.

Now this error is also misleading because the problem is not invalid credentials on the face of it. A better error message would say invalid credentials for the tenant of the target data source.

Problem Overview

As most systems and environments evolve its common (given the experience of several customer) to accidentally create a disconnection between your Azure Subscription and your Office 365 environments. This may result in each group of services residing in different directory services or tenants.


In this case the disconnection means you will not be able to authenticate your PowerBI.com datasets against your Azure Data Lake Store allowing for that very important scheduled data refresh.
Coming back to the title of this blog post:

You cannot currently authenticate against an Azure Data Lake Store from PowerBI.com across tenants.

What Do To

Once you’ve finished cursing, considering everything you’ve developed over the last 6 months in your Azure Subscription. Take a breath. Unfortunately, the only long term thing you can do is setup a new Azure Subscription and make dam sure that it’s linked to your Office 365 office and thus residing in the same tenant. Then migrate your Data Lake Store to the new subscription.


Once these ducks are in line the credentials you supply to PowerBI.com for the dataset refresh will be accepted. I promise. I’ve done it.

A short-term work around is to refresh your datasets in the desktop app every day and republish new versions. Very manual. Sorry to be the bearer of bad news.

What Next

Well my friends, I recommend that we strongly petition Microsoft to lift this restriction. I say restriction because it seems like madness. After all the PowerBI.com connector to Azure Data Lake is using OAuth2, so what’s the problem. Furthermore, back in Power BI Desktop land we can connect to any storage with any credentials. We can even create a Power BI workbook joining 2 Data Lake Stores with 2 different sets of credentials (handy if you have a partial data model in production and new outputs in a test environment).

Here is my attempt to get things changed and I’d appreciate your votes.

https://office365.uservoice.com/forums/264636-general/suggestions/17841250-allow-powerbi-com-to-connect-directly-to-azure-dat

To conclude, I really want this blog post to get an update soon with some better news given the above. But for now, I hope it helped you understand the potential problem your facing. Or, raises your awareness to a future problem you are likely to encounter.

Many thanks for reading.


Recursive U-SQL With PowerShell (U-SQL Looping)

In its natural form U-SQL does not support recursive operations and for good reason. This is a big data, scale out, declarative language where the inclusion of procedural, iterative code would be very unnatural. That said, if you must pervert things PowerShell can assist with the looping and dare I say the possibility for dynamic U-SQL.

A couple of caveats…

  • From the outset, I accept this abstraction with PowerShell to achieve some iterative process in U-SQL is a bit of a hack and very inefficient, certainly in the below example.
  • The scenario here is not perfect and created using a simple situation for the purposes of explanation only. I know the same data results could be achieved just by extending the aggregate grouping!

Hopefully that sets the scene. As I’m writing, I’m wondering if this blog post will be hated by the purists out there. Or loved by the abstraction hackers. I think I’m both 🙂

Scenario

As with most of my U-SQL blog posts I’m going to start with the Visual Studio project available as part of the data lake tools extension called ‘U-SQL Sample Application’. This gives us all the basic start up code and sample data to get going.

Input: within the solution (Samples > Data > Ambulance Data) we have some CSV files for vehicles. These are separated into 16 sources datasets covering 4 different vehicle ID’s across 4 days.

Output: let’s say we have a requirement to find out the average speed of each vehicle per day. Easily enough with a U-SQL wildcard on the extractor. But we also want to output a single file for each day of data. Not so easy, unless we write 1 query for each day of data. Fine with samples only covering 4 days, not so fine with 2 years of records split by vehicle.

Scenario set, lets look at how we might do this.

The U-SQL

To produce the required daily outputs I’m going to use a U-SQL query in isolation to return a distinct list of dates across the 16 input datasets, plus a parameterised U-SQL stored procedure to do the aggregation and output a single day of data.

First getting the dates. The below simply returns a text file containing a distinct list of all the dates in our source data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
DECLARE @InputPath string = "/Samples/Data/AmbulanceData/{filename}";
 
@DATA =
    EXTRACT 
        [vehicle_id] INT,
        [entry_id] long,
        [event_date] DateTime,
        [latitude] FLOAT,
        [longitude] FLOAT,
        [speed] INT,
        [direction] string,
        [trip_id] INT?,
        [filename] string
    FROM 
        @InputPath
    USING 
        Extractors.Csv();
 
@DateList =
    SELECT DISTINCT 
        [event_date].ToString("yyyyMMdd") AS EventDateList
    FROM 
        @DATA;
 
OUTPUT @DateList
TO "/output/AmbulanceDataDateList.txt"
USING Outputters.Csv(quoting : FALSE, outputHeader : FALSE);

Next, the below stored procedure. This uses the same input files, but does the required aggregation and outputs a daily file matching the parameter passed giving us a sinlge output.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
CREATE PROCEDURE IF NOT EXISTS [dbo].[usp_OutputDailyAvgSpeed]
    (
    @OutputDate string
    )
AS
BEGIN
 
    //DECLARE @OutputDate string = "20140914"; //FOR dev
    DECLARE @InputPath string = "/Samples/Data/AmbulanceData/{filename}";
    DECLARE @OutputPath string = "/output/DailyRecords/VehicleAvgSpeed_" + @OutputDate + ".csv";
 
    @DATA =
        EXTRACT 
            [vehicle_id] INT,
            [entry_id] long,
            [event_date] DateTime,
            [latitude] FLOAT,
            [longitude] FLOAT,
            [speed] INT,
            [direction] string,
            [trip_id] INT?,
            [filename] string
        FROM 
            @InputPath
        USING 
            Extractors.Csv();
 
    @VAvgSpeed =
        SELECT DISTINCT 
            [vehicle_id],
            AVG([speed]) AS AverageSpeed
        FROM 
            @DATA
        WHERE
            [event_date].ToString("yyyyMMdd") == @OutputDate
        GROUP BY
            [vehicle_id];
 
    OUTPUT @VAvgSpeed
    TO @OutputPath
    USING Outputters.Csv(quoting : TRUE, outputHeader : TRUE);
 
END;

At this point, we could just execute the stored procedures with each required date, manually crafted from the text file. Like this:

1
2
3
4
[dbo].[usp_OutputDailyAvgSpeed]("20140914");
[dbo].[usp_OutputDailyAvgSpeed]("20140915");
[dbo].[usp_OutputDailyAvgSpeed]("20140916");
[dbo].[usp_OutputDailyAvgSpeed]("20140917");

Fine, for small amounts of data, but we can do better for larger datasets.

Enter PowerShell and some looping.

The PowerShell

As with all things Microsoft PowerShell is our friend and the supporting cmdlets for the Azure Data Lake services are no exception. I recommend these links if you haven’t yet written some PowerShell to control ADL Analytics jobs or upload files to ADL Storage.

Moving on. How can PowerShell help us script our data output requirements? Well, here’s the answer, in my PowerShell script below I’ve done the following.

  1. Authenticate against my Azure subscription (optionally create yourself a PSCredential to do this).
  2. Submit the first U-SQL query as a file to return the distinct list of dates.
  3. Wait for the ADL Analytics job to complete.
  4. Download the output text file from ADL storage.
  5. Read the contents of the text file.
  6. Iterate over each dates listed in the text file.
  7. Submit a U-SQL job for each stored procedure with the date passed from the list.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#Params...
$WhereAmI = $MyInvocation.MyCommand.Path.Replace($MyInvocation.MyCommand.Name,"")
 
$DLAnalyticsName = "myfirstdatalakeanalysis" 
$DLAnalyticsDoP = 10
$DLStoreName = "myfirstdatalakestore01"
 
 
#Create Azure Connection
Login-AzureRmAccount | Out-Null
 
$USQLFile = $WhereAmI + "RecursiveOutputPrep.usql"
$PrepOutput = $WhereAmI + "AmbulanceDataDateList.txt"
 
#Summit Job
$job = Submit-AzureRmDataLakeAnalyticsJob `
    -Name "GetDateList" `
    -AccountName $DLAnalyticsName `
    –ScriptPath $USQLFile `
    -DegreeOfParallelism $DLAnalyticsDoP
 
Write-Host "Submitted USQL prep job."
 
#Wait for job to complete
Wait-AdlJob -Account $DLAnalyticsName -JobId $job.JobId | Out-Null
 
Write-Host "Downloading USQL output file."
 
#Download date list
Export-AzureRmDataLakeStoreItem `
    -AccountName $DLStoreName `
    -Path $myrootdir\output\AmbulanceDataDateList.csv `
    -Destination $PrepOutput | Out-Null
 
Write-Host "Downloaded USQL output file."
 
#Read dates
$Dates = Get-Content $PrepOutput
 
Write-Host "Read date list."
 
#Loop over dates with proc call for each
ForEach ($Date in $Dates)
    {
    $USQLProcCall = '[dbo].[usp_OutputDailyAvgSpeed]("' + $Date + '");'
    $JobName = 'Output daily avg dataset for ' + $Date
 
    Write-Host $USQLProcCall
 
    $job = Submit-AzureRmDataLakeAnalyticsJob `
        -Name $JobName `
        -AccountName $DLAnalyticsName `
        –Script $USQLProcCall `
        -DegreeOfParallelism $DLAnalyticsDoP
 
    Write-Host "Job submitted for " $Date
    }
 
Write-Host "Script complete. USQL jobs running."

At this point I think its worth reminding you of my caveats above 🙂

I would like to point out the flexibility in the PowerShell cmdlet Submit-AzureRmDataLakeAnalyticsJob. Allowing us to pass a U-SQL file (step 2) or build up a U-SQL string dynamically within the PowerShell script and pass that as the execution code (step 7), very handy. Switches: -Script or -ScriptPath.

If all goes well you should have jobs being prepared and shortly after running to produce the daily output files.

I used 10 AU’s for my jobs because I wanted to burn up some old Azure credits, but you can change this in the PowerShell variable $DLAnalyticsDoP.

Conclusion

It’s possible to archive looping behaviour with U-SQL when we want to produce multiple output files, but only when we abstract the iterative behaviour away to our friend PowerShell.

Comments welcome on this approach.

Many thanks for reading.

 

Ps. To make life a little easier. I’ve stuck all of the above code and sample data into a GitHub repostiory to save you copy and pasting things from the code windows above.

https://github.com/mrpaulandrew/RecursiveU-SQLWithPowerShell

 

 


Calling U-SQL Stored Procedures with C# Code Behind

So friends, some more lessons learnt when developing with U-SQL and Azure Data Lake. I’ll try and keep this short.

Problem

You have a U-SQL stored procedure written and working fine within your Azure Data Lake Analytics service. But we need to add some more business logic or something requiring a little C# magic. This is the main thing I love about U-SQL, having that C# code behind file where I can extend my normal SQL behaviour. So, being a happy little developer you write your class and method to support the U-SQL above and you recreate your stored procedure. Great!

Next, you try to run that stored procedure…

[ExampleDatabase].[dbo].[SimpleProc]();

But are hit with an error, similar to this:

E_CSC_USER_INVALIDCSHARP: C# error CS0103: The name ‘SomeNameSpaceForCodeBehind’ does not exist in the current context.


Why?

Submitting U-SQL queries containing C# code behind methods works fine normally. But once you wrap it up as a stored procedure within the ADL analytics database the complied C# is lost. Almost as if the U-SQL file/procedure no longer has its lovely code behind file at all!

Just to be explicit with the issue. Here is an example stored procedure that I’ve modified from the Visual Studio U-SQL Sample Application project. Note my GetHelloWord method that I’ve added just for demonstration purposes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
DROP PROCEDURE IF EXISTS [dbo].[SimpleProc];
 
CREATE PROCEDURE [dbo].[SimpleProc]()
AS
BEGIN
 
    @searchlog =
        EXTRACT UserId INT,
                START DateTime,
                Region string,
                Query string,
                Duration INT?,
                Urls string,
                ClickedUrls string
        FROM "/Samples/Data/SearchLog.tsv"
        USING Extractors.Tsv();
 
    @WithCodeBehind =
        SELECT 
            *,
            SomeNameSpaceForCodeBehind.MyCodeBehind.GetHelloWorld() AS SomeText
        FROM @searchlog;
 
    OUTPUT @WithCodeBehind
    TO "/output/SearchLogResult1.csv"
    USING Outputters.Csv();
 
END;

This U-SQL file then has the following C#, with my totally original naming conventions. No trolls please, this is not the point of this post 🙂

1
2
3
4
5
6
7
8
9
10
11
namespace SomeNameSpaceForCodeBehind
{
    public class MyCodeBehind
    {
        static public string GetHelloWorld()
        {
            string text = "HelloWorld";
            return text;
        }
    }
}

So, this is what doesn’t work. Problem hopefully clearly defined.

Solution

To work around this problem, instead of using a C# code behind file for the procedure we need to move the class into its own assembly. This requires a little more effort and plumbing, but does solve this problem. Plus, this approach is probably more familiar to people that have ever worked with CLR functions in SQL Sever that they want to use within a stored procedure.

This is what we need to do.

  • Add a C# class library to your Visual Studio solution and move the U-SQL code behind into a library name space.

  • Build the library and use the DLL to create an assembly within the ADL analytics database. The DLL can live in your ADL store root, in line it or create it from Azure Blob Store. I have another post on that here if your interested.
CREATE ASSEMBLY IF NOT EXISTS [HelloWorld] FROM "assembly/ClassLibrary1.dll";
  • Finally, modify your stored procedure to use the assembly instead of the code behind name space. The new stored procedure should look like this.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
DROP PROCEDURE IF EXISTS [dbo].[SimpleProc];
 
CREATE PROCEDURE [dbo].[SimpleProc]()
AS
BEGIN
 
    //Complied library:
    REFERENCE ASSEMBLY [HelloWorld];
 
    @searchlog =
        EXTRACT UserId INT,
                START DateTime,
                Region string,
                Query string,
                Duration INT?,
                Urls string,
                ClickedUrls string
        FROM "/Samples/Data/SearchLog.tsv"
        USING Extractors.Tsv();
 
    @WithCodeBehind =
        SELECT *,
               //Changed TO USE assembly:
               HelloWorld.ClassLibrary1.GetHelloWorld() AS SomeText
        FROM @searchlog;
 
    OUTPUT @WithCodeBehind
    TO "/output/SearchLogResult1.csv"
    USING Outputters.Csv();
 
END;

This new procedure executes without error and gets around the problem above.

I hope this helps and allows you to convert those complex U-SQL scripts to procedures, while retaining any valuable code behind functionality.

Many thanks for reading

Storing U-SQL Assemblies in Azure Blob Storage

I’m hoping the title of this post is fairly self explanatory. Your here because like me you found that the MSDN language reference page for creating U-SQL assemblies states that it’s possible to store the DLL’s in Azure Blob Storage. But it doesn’t actually tell you how. Well please continue my friends and I’ll show you how.

The offending article: https://msdn.microsoft.com/en-us/library/azure/mt763293.aspx

The offending text snippet:

“Assembly_Source
Specifies the assembly DLL either in form of a binary literal or as a string literal or static string expression/string variable. The binary literal represents the actual .NET assembly DLL, while the string values represent a URI or file path to a .NET assembly DLL file in either an accessible Azure Data Lake Storage or Windows Azure Blob Storage. If the provided source is a valid .NET assembly, the assembly will be copied and registered, otherwise an error is raised.”

Before going any further, this post isn’t a dig at the usual lack of Microsoft documentation. Mainly because when I posted this problem as a question on Stack Overflow the missing information was provided from the horses mouth, Mr Michael Rys (@MikeDoesBigData). Therefore, all is forgiven and I’m more than happy to write this post on Microsoft’s behalf and for my fellow developers. #SQLFamily

http://stackoverflow.com/questions/40842170/create-usql-assembly-from-dll-in-azure-blob-storage

Thanks again Mike. Moving on…

Assumptions

Within your Azure subscription you have the following services already deployed and running.

  • Azure Data Lake Analytics (ADLa)
  • Azure Data Lake Store (ADLs)
  • Azure Storage, with a suitable blob container.

You are also comfortable with referencing assemblies in your U-SQL scripts and so far have done so by in lining the complied assembly in the U-SQL file. Or have stored the DLL in ADLs with a simple file path reference to the ADLs root directory.

Granting Access

The most important thing you’ll need to do to get this working, as Mike mentions in the SO answer, is allow your ADLa service to access the blob storage account. This only requires a few clicks in the Azure portal.

From the ADLa blade choose Data Sources and click Add Data Source.

Populate the preceding blade drop down menus with your preferred choices and click Add. You should then have the storage account listed as a ADLa data source. As below

Note; The Azure Storage account doesn’t need to be in the same data centre as the ADLa service, unlike ADLa and ADLs.

Create Assembly

Next the U-SQL.

To reference a DLL in the blob storage account container we need to create the assembly using the wasb URL. Like this:

wasb://YourBlobContainerName@YourBlobStorageAccountName.blob.core.windows.net/YourAssembly.dll

Complete CREATE ASSEMBLY syntax.

1
2
3
CREATE ASSEMBLY IF NOT EXISTS [YourSchema].[PurpleFrog.Pauls.DataLakeHelperFunctions]
FROM "wasb://AllSupportingFiles@MiscBlobsAccount.blob.core.windows.net/
      PurpleFrog.Pauls.DataLakeHelperFunctions.dll";

Why Do This

Hopefully pre-empting some comments on this post. Given that we can inline the assembly and store it in ADLs. Why would you want to put the DLL’s in a separate storage account?

Well, this is really just for operational convenience. In a recent project I was working on we had created a lot of custom code. Not just for Azure Data Lake, but also Azure Data Factory. We therefore used a blob storage account as a support bucket for all compiled code and parent object files. This gave us a centralised place to deploy to regardless of what service was consuming the libraries. Again, just for convenience. All DLL’s in one place for all services.

I hope you found this post helpful.

Many thanks for reading.


Creating a U-SQL Date Dimension & Numbers Table in Azure Data Lake

Now we all know what a date dimension is and there are plenty of really great examples out there for creating them in various languages. Well, here’s my U-SQL version creating the output from scratch using a numbers table. Remember that U-SQL needs to be handled slightly differently because we don’t have any iterative functionality available. Plus its ability to massively parallelise jobs means we can’t write something that relies on procedural code.

This is version 1…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
//Enter start and end dates for the date dimension:
DECLARE @StartDate DateTime = new DateTime(2017,1,1);
DECLARE @EndDate DateTime = new DateTime(2018,12,31);
 
//Create numbers table
@Numbers0 = SELECT * FROM (VALUES(0)) AS Row (Number);
@Numbers1 = SELECT [Number] FROM @Numbers0 
    UNION ALL SELECT [Number]+1 AS Number FROM @Numbers0;
@Numbers2 = SELECT [Number] FROM @Numbers1 
    UNION ALL SELECT [Number]+2 AS Number FROM @Numbers1;
@Numbers4 = SELECT [Number] FROM @Numbers2 
    UNION ALL SELECT [Number]+4 AS Number FROM @Numbers2;
@Numbers8 = SELECT [Number] FROM @Numbers4 
    UNION ALL SELECT [Number]+8 AS Number FROM @Numbers4;
@Numbers16 = SELECT [Number] FROM @Numbers8 
    UNION ALL SELECT [Number]+16 AS Number FROM @Numbers8;
@Numbers32 = SELECT [Number] FROM @Numbers16 
    UNION ALL SELECT [Number]+32 AS Number FROM @Numbers16;
@Numbers64 = SELECT [Number] FROM @Numbers32 
    UNION ALL SELECT [Number]+64 AS Number FROM @Numbers32;
//Double it again if you want it bigger...
 
//Create date dimension
@DateDimension = 
SELECT 
    int.Parse([Date].ToString("yyyyMMdd")) AS DateKey,
    [Date],
    [Date].ToString("dd/MM/yyyy") AS DateString,
    [Date].Day AS Day,
    [Date].Month AS Month,
    Math.Floor(((decimal)[Date].Month + 2) / 3) AS Quater,
    [Date].Year AS Year,
    Convert.ToInt32([Date].DayOfWeek) + 1 AS DayOfWeekNo,
    [Date].ToString("dddd") AS DayName,
    [Date].ToString("MMMM") AS MonthName,
    [Date].Month >=4 ? [Date].ToString("yyyy")+"/"+([Date].AddYears(+1)).ToString("yy") 
        : ([Date].Year - 1).ToString() + "/" + [Date].ToString("yy") AS FinancialYear,
    DateTimeFormatInfo.CurrentInfo.Calendar.GetWeekOfYear(
        [Date], CalendarWeekRule.FirstDay, System.DayOfWeek.Sunday) AS WeekNoOfYear
FROM
    (
    SELECT 
        @StartDate.AddDays(Convert.ToDouble([RowNumber]) -1) AS Date
    FROM 
        (
        SELECT
            ROW_NUMBER() OVER (ORDER BY n1.[Number]) AS RowNumber
        FROM 
            @Numbers64 AS n1
            CROSS JOIN @Numbers64 AS n2 //make it big!
        ) AS x
    ) AS y
WHERE
    [Date] <= @EndDate; //cheat to cut off results
 
 
//Output files
OUTPUT @DateDimension
TO "/Stuff/DateDimension.csv"
ORDER BY [Date] ASC
USING Outputters.Csv(quoting : true, outputHeader : true);
 
//Get a numbers table as a bonus :-)
OUTPUT @Numbers64
TO "/Stuff/Numbers.csv"
ORDER BY [Number] ASC
USING Outputters.Csv(quoting : true, outputHeader : true);

In version 2 I may use a replicated string that EXPLODE’s from an array rather than using a numbers table. But that’s for another time. I included the numbers table as an output in this one as a little bonus 🙂

Hope this helps you out while swimming in the Azure Data Lake.

Many thanks for reading.


Passing Parameters to U-SQL from Azure Data Factory

Let’s try and keep this post short and sweet. Diving right in imagine a scenario where we have an Azure Data Factory (ADF) pipeline that includes activities to perform U-SQL jobs in Azure Data Lake (ADL) Analytics. We want to control the U-SQL by passing the ADF time slice value to the script, hopefully a fairly common use case. This isn’t yet that intuitive when constructing the ADF JSON activity so I hope this post will save you some debugging time.

For my example I’ve created a stored procedure in my ADL Analytics database that accepts a parameter @TimeSliceStart as a string value in the format yyyyMMdd.

[ExampleDatabase].[dbo].[usp_DoSomeStuff](@TimeSliceStart);

This doesn’t have to be a stored procedure. ADF is also happy if you give if U-SQL files or even just inline the entire script. Regardless, the ADF parameter handling is the same.

In my ADF JSON activity I then have the following;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"name": "DataLakeJob1",
"properties": {
"description": "Run USQL with timeslice param",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
	"script": "[ExampleDatabase].[dbo].[usp_DoSomeStuff](@TimeSliceStart);",
	"degreeOfParallelism": 5,
	"priority": 1,
	"parameters": {
		"TimeSliceStart": "$$Text.Format('{0:yyyyMMdd}', Time.AddMinutes(SliceStart, 0))"
	}
} // etc ... } ] // etc ...}}}

Notice in the extended properties we have a parameters attribute that can include children of the actual variables we want to pass to the U-SQL script.

Here’s the important things to understand about this ADF parameters attribute.

  • The name of the parameter must match the name of the variable expected by U-SQL exactly.
  • As you would expect the data types of the expected variable and JSON parameter must match.
  • It is perfectly acceptable to have multiple parameters in the ADF JSON and written in any order.

So how does this work?…

What ADF does when calling ADL is take the parameters listed in the  JSON and write out a bunch of U-SQL ‘DECLARE @Variable’ lines. These then get appended to the top of the actual U-SQL script before giving it to ADL as a job to run. You can see this if you go into the ADL Analytics blades in the Azure portal, select the job created by the ADF activity then choose Duplicate Script. This reveals the actual U-SQL used in the job.

Here’s the proof.

 

 

 

 

 

 

 

 

 

 

Then…

Just knowing what ADF does when converting the JSON parameters to U-SQL declared variables is the main take away here.

That’s it! I promised short and sweet.

Many thanks for reading.


Writing a U-SQL Merge Statement

Unlike T-SQL, U-SQL does not currently support MERGE statements. Our friend that we have come to know and love since its introduction in SQL Server 2008. Not only that, but U-SQL also doesn’t currently support UPDATE statements either… I know… Open mouth emoji required! This immediately leads to the problem of change detection in our data and how, for example, we should handle the ingestion of a daily rolling 28-day TSV extract, requiring a complete year to date output. Well in this post we will solve that very problem.

Now before we move on it’s worth pointing out that U-SQL is now our/my language of choice for working with big data in Azure using the Data Lake Analytics service. It’s not yet the way of things for our on premises SQL Server databases, so relax. T-SQL or R are still our out-of-the-box tools there (SQL Server 2016). Also if you want to take a step back from this slightly deeper U-SQL topic and find out What is U-SQL first, I can recommend my purple amphibious colleagues blog, link below.

https://www.purplefrogsystems.com/blog/2016/02/what-is-u-sql/

Assuming you’re comfortable with the U-SQL basics let’s move on. For the below examples I’m working with Azure Data Lake (ADL) Analytics, deployed in my Azure MSDN subscription. Although we can do everything here in the local Visual Studio emulator, without the cloud service (very cool and handy for development). I also have the Visual Studio Data Lake tools for the service available and installed. Specifically for this topic I have created a ‘U-SQL Sample Application’ project to get us started. This is simply for ease of explanation and so you can get most of the setup code for what I’m doing here without any great difficulty. Visual Studio Data Lake tools download link below if needed.

https://www.microsoft.com/en-us/download/details.aspx?id=49504

newdatalakevssampleapp

Once we have this solution available including its Ambulance and Search Log samples please find where your Visual Studio Cloud Explorer panel (Ctrl + \, Ctrl + X) is hiding as we’ll use this to access the local Data Lake Analytics database on your machine.

vscloudandsolutionpanelsDatabase Objects

To get things rolling open the U-SQL file from the sample project called ‘SearchLog-4-CreatingTable’ and execute AKA ‘Submit’ this to run locally against your ADL Analytics instance. This gives us a database and target table to work with for the merge operation. It also inserts some tab separated sample data into the table. If we don’t insert this initial dataset you’ll find joining to an empty table will prove troublesome.

Now U-SQL is all about extracting and outputting at scale. There isn’t any syntax sugar to merge datasets. But do we really need the sugar? No. So, we are going to use a database table as our archive or holding area to ensure we get the desired ‘upsert’ behaviour. Then write our MERGE statement long hand using a series of conventional joins. Not syntax sugar. Just good old fashioned joining of datasets to get the old, the new and the changed.

Recap of the scenario; we have a daily rolling 28-day input, requiring a full year to date output.

Merging Data

Next open the U-SQL file from the sample project called ‘SearchLog-1-First_U-SQL_Script’. This is a reasonable template to adapt as it contains the EXTRACT and OUTPUT code blocks already.

For the MERGE we next need a set of three SELECT statements joining both the EXTRACT (new/changed data) and table (old data) together. These are as follows, in mostly English type syntax first 🙂

  • For the UPDATE, we’ll do an INNER JOIN. Table to file. Taking fields from the EXTRACT only.
  • For the INSERT, we’ll do a LEFT OUTER JOIN. File to table. Taking fields from the EXTRACT where NULL in the table.
  • To retain old data, we’ll do a RIGHT OUTER JOIN. File to table. Taking fields from the table where NULL in the file.

Each of the three SELECT statements can then have UNION ALL conditions between them to form a complete dataset including any changed values, new values and old values loaded by a previous file. This is the code you’ll want to add for the example in your open file between the extract and output code blocks. Please don’t just copy and paste without understanding it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
@AllData =
    --update current
    SELECT e1.[UserId],
           e1.[START],
           e1.[Region],
           e1.[Query],
           e1.[Duration],
           e1.[Urls],
           e1.[ClickedUrls]
    FROM [SearchLogDemo].[dbo].[SearchLog] AS t1
         INNER JOIN
             @searchlog AS e1
         ON t1.[UserId] == e1.[UserId]
 
    UNION ALL
 
    --insert new
    SELECT e2.[UserId],
           e2.[START],
           e2.[Region],
           e2.[Query],
           e2.[Duration],
           e2.[Urls],
           e2.[ClickedUrls]
    FROM @searchlog AS e2
         LEFT OUTER JOIN
             [SearchLogDemo].[dbo].[SearchLog] AS t2
         ON t2.[UserId] == e2.[UserId]
    WHERE
    t2.[UserId] IS NULL
 
    UNION ALL
 
    --keep existing
    SELECT t3.[UserId],
           t3.[START],
           t3.[Region],
           t3.[Query],
           t3.[Duration],
           t3.[Urls],
           t3.[ClickedUrls]
    FROM @searchlog AS e3
         RIGHT OUTER JOIN
             [SearchLogDemo].[dbo].[SearchLog] AS t3
         ON t3.[UserId] == e3.[UserId]
    WHERE
    e3.[UserId] IS NULL;

This union of data can then OUTPUT to our usable destination doing what U-SQL does well before resetting our ADL Analytics table for the next load. By reset, I mean TRUNCATE the table and INSERT everything from @AllData back into it. This preserves our history/our old data and allows the MERGE behaviour to work again and again using only SELECT statements.

Replacing the OUTPUT variable from @searchlog, you’ll then want to add the following code below the three SELECT statements.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
OUTPUT @AllData
TO "/output/SearchLogAllData.csv"
USING Outputters.Csv();
 
TRUNCATE TABLE [SearchLogDemo].[dbo].[SearchLog];
 
INSERT INTO [SearchLogDemo].[dbo].[SearchLog]
(
    [UserId],
    [START],
    [Region],
    [Query],
    [Duration],
    [Urls],
    [ClickedUrls]
)
SELECT [UserId],
       [START],
       [Region],
       [Query],
       [Duration],
       [Urls],
       [ClickedUrls]
FROM @AllData;

If all goes well you can edit the ‘SearchLog.tsv’ file removing and changing data and keep rerunning the U-SQL script performing the MERGE behaviour. Please test away. Don’t just believe me that it works. As a bonus you get this pretty job diagram too…

localusqlmergejob

The only caveat here is that we can’t deal with deletion detection from the source file… Unless we do something a little more complex for the current loading period. Lets save that for a later blog post.

A couple of follow up general tips.

  • Have a USE condition at the top of your scripts to ensure you hit the correct database. Just like T-SQL.
  • If your struggling for fields to join on as you don’t have a primary key. You could use UNION instead of UNION ALL. But this of course takes more effort to work out the distinct values. Just like T-SQL.
  • Be careful with C# data types and case sensitivity. U-SQL is not as casual as T-SQL with such things.

That’s it. U-SQL merge behaviour achieved. I guess the bigger lesson here for the many T-SQL people out there is; don’t forget the basics, its still a structured query language. Syntax sugar is sweet, but not essential.

Hope this was helpful.

Many thanks for reading


Azure Data Lake Authentication from Azure Data Factory

rereTo set the scene for the title of this blog post lets firstly think about other services within Azure. You’ll probably already know that most services deployed require authentication via some form of connection string and generated key. These keys can be granted various levels of access and also recycled as required, for example an IoT Event Hub seen below (my favourite service to play with).

levelskeysandconnectionstrings

Then we have other services like SQLDB that require user credentials to authenticate as we would expect from the on premises version of the product. Finally we have a few other Azure services that handle authentication in a very different way altogether requiring both user credentials initially and then giving us session and token keys to be used by callers. These session and token keys are a lot more fragile than connection strings and can expire or become invalid if the calling service gets rebuilt or redeployed.

In this blog post I’ll explore and demonstrate specifically how we handle session and token based authentication for Azure Data Lake (ADL), firstly when calling it as a Linked Service from Azure Data Factory (ADF), then secondly within ADF custom activities. The latter of these two ADF based operations becomes a little more difficult because the .Net code created and compiled is unfortunately treated as a distant relative to ADF requiring its own authentication to ADL storage as an Azure Application. To further clarify a Custom Activity in ADF does not inherit its authorising credentials from the parent Linked Service, it is responsible for its own session/token. Why? Because as you may know from reading my previous blog post; Custom Activates get complied and executed by an Azure Batch Service. Meaning the compute for the .Net code is very much detached from ADF.

At this point I should mention that this applies to Data Lake Analytics and Data Lake Storage. Both require the same approach to authentication.

Data Lake as a Service Within Data Factory

adf-author-and-deploy-buttonThe easy one first, adding an Azure Data Lake service to your Data Factory pipeline. From the Azure portal within the ADF Author and Deploy blade you simply add a new Data Lake Linked Service which returns a JSON template for the operation into the right hand panel. Then we kindly get provided with an Authorize button (spelt wrong) at the top of the code block.

Clicking this will pop up with the standard Microsoft login screen requesting work or personal user details etc. Upon competition or successful authentication your Azure subscription will be inspected. If more than one applicable service exists, you’ll of course need to select which you require authorisation for. But once done you’ll return to the JSON template now with a completed Authorization value and SessionId.

adf-adl-json-templateJust for information and to give you some idea of the differences in this type of authorisation compared to other Azure services. When I performed this task for the purpose of creating screen shots in this post the resulting Authorization URL was 1219 characters long and the returned SessionId was 1100! Or half a page of a standard Word document each. By comparison an IoT Hub key is only 44 characters. Furthermore, the two values are customised to the service that requested them and can only be used within the context where they were created.

For completeness, because we can also now develop ADF pipelines from Visual Studio it’s worth knowing that a similar operation is now available as part of the Data Factory extension. In Visual Studio within your ADF project on the Linked Service branch you are able to Right Click > Add > New Item and choose Data Lake Store or Analytics. You’ll then be taken through a wizard (similar in look to that of the ADF deployment wizard) which requests user details, the ADF context and returns the same JSON template with populated authorising values.

vs-adf-adl-addservice

A couple of follow up tips and lessons learnt here:

  • If you tell Visual Studio to reverse engineer your ADF pipeline from a current Azure deployed factory where an existing ADL token and session ID are available. These will not be brought into Visual Studio and you’ll need to authorise the service again.
  • If you copy an ADL JSON template from the Azure portal ‘Author and Deploy’ area Visual Studio will not popup the wizard to authorise the service and you’ll need to do it again.
  • If you delete the ADL Linked Service within the portal ‘Author and Deploy’ area. The same Linked Service tokens in Visual Studio will become invalid and you’ll need to authorise the service again.
  • If you sneeze to loudly while Visual Studio is open you’ll need to authorise the service again.

Do you get the idea when I said earlier that the authorisation method is fragile? Very sophisticated, but fragile when chopping and changes things during development.

What you may find yourself doing fairly frequently is:

  1. Deploying an ADF project from Visual Studio.
  2. The deployment wizard failing telling you the ADL tokens have expired or are no longer authorised.
  3. Adding a new Linked Service to the project just to get the user authentication wizard popup.
  4. Then copying the new token and session values into the existing ADL Linked Service JSON file.
  5. Then excluding the new services you created just to re-authorise from the Visual Studio project.

Fun! Moving on.

Update: you can use an Azure AD service principal to authenticate both Azure Data Lake Store and Azure Data Lake Analytics services from ADF. Details are included in this post: https://docs.microsoft.com/en-gb/azure/data-factory/v1/data-factory-azure-datalake-connector#azure-data-lake-store-linked-service-properties

Data Factory Custom Activity Call Data Lake

Next the slightly more difficult way to authenticate against ADL, using an ADF .Net Custom Activity. As mentioned previously the .Net code once sent to Azure as a DLL is treated as a third party application requiring its own credentials.

The easiest way I’ve found to getting this working is firstly to use PowerShell to register the application in Azure which using the correct CMDLets returns an application GUID and password which when combined give the .Net code its credentials. Here’s the PowerShell you’ll need below. Be sure you run this with elevated permissions locally.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Sign in to Azure.
Add-AzureRmAccount
 
#Set this variables
$appName = "SomeNameThatYouWillRegoniseInThePortal"
$uri = "AValidURIAlthoughNotApplicableForThis"
$secret = "SomePasswordForTheApplication"
 
# Create a AAD app
$azureAdApplication = New-AzureRmADApplication `
    -DisplayName $appName `
    -HomePage $Uri `
    -IdentifierUris $Uri `
    -Password $secret
 
# Create a Service Principal for the app
$svcprincipal = New-AzureRmADServicePrincipal -ApplicationId $azureAdApplication.ApplicationId
 
# To avoid a PrincipalNotFound error, I pause here for 15 seconds.
Start-Sleep -s 15
 
# If you still get a PrincipalNotFound error, then rerun the following until successful. 
$roleassignment = New-AzureRmRoleAssignment `
    -RoleDefinitionName Contributor `
    -ServicePrincipalName $azureAdApplication.ApplicationId.Guid
 
# The stuff you want:
 
Write-Output "Copy these values into the C# sample app"
 
Write-Output "_subscriptionId:" (Get-AzureRmContext).Subscription.SubscriptionId
Write-Output "_tenantId:" (Get-AzureRmContext).Tenant.TenantId
Write-Output "_applicationId:" $azureAdApplication.ApplicationId.Guid
Write-Output "_applicationSecret:" $secret
Write-Output "_environmentName:" (Get-AzureRmContext).Environment.Name

My recommendation here is to take the returned values and store that in something like the Class Library settings, available from the Visual Studio project properties. Don’t store them as constants at the top of your Class as its highly likely you’ll need them multiple times.

Next, what to do with the application GUID etc. Well in your Custom Activity C# will need something like the following. Apologies for dumping massive code blocks into this post, but you will need all of this in your Class if you want to use details from your ADF service and work with ADL files.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
class SomeCustomActivity : IDotNetActivity
{
	//Get credentials for app
	string domainName = Settings.Default.AzureDomainName;
	string appId = Settings.Default.ExcelExtractorAppId; //From PowerShell &lt;&lt;&lt;&lt;&lt;
	string appPass = Settings.Default.ExceExtractorAppPass; //From PowerShell &lt;&lt;&lt;&lt;&lt;
	string appName = Settings.Default.ExceExtractorAppName; //From PowerShell &lt;&lt;&lt;&lt;&lt;
 
	private static DataLakeStoreFileSystemManagementClient adlsFileSystemClient;
	//and or:
	private static DataLakeStoreAccountManagementClient adlsAccountManagerClient;
 
	public IDictionary&lt;string, string&gt; Execute(
		IEnumerable linkedServices,
		IEnumerable datasets,
		Activity activity,
		IActivityLogger logger)
	{
		//Get linked service details from Data Factory
		Dataset inputDataset = new Dataset();
		inputDataset = datasets.Single(dataset =&gt; 
			dataset.Name == activity.Inputs.Single().Name);
 
		AzureDataLakeStoreLinkedService inputLinkedService;
 
		inputLinkedService = linkedServices.First(
			linkedService =&gt;
			linkedService.Name ==
			inputDataset.Properties.LinkedServiceName).Properties.TypeProperties
			as AzureDataLakeStoreLinkedService;
 
		//Get account name for data lake and create credentials for app
		var creds = AuthenticateAzure(domainName, appId, appPass);
		string accountName = inputLinkedService.AccountName;
 
		//Authorise new instance of Data Lake Store
		adlsFileSystemClient = new DataLakeStoreFileSystemManagementClient(creds);
 
		/*
			DO STUFF...
 
			using (Stream input = adlsFileSystemClient.FileSystem.Open
				(accountName, completeInputPath)
				)
		*/	
 
 
		return new Dictionary&lt;string, string&gt;();
	}
 
 
	private static ServiceClientCredentials AuthenticateAzure
		(string domainName, string clientID, string clientSecret)
	{
		SynchronizationContext.SetSynchronizationContext(new SynchronizationContext());
 
		var clientCredential = new ClientCredential(clientID, clientSecret);
		return ApplicationTokenProvider.LoginSilentAsync(domainName, clientCredential).Result;
	}
}

Finally, before you execute anything be sure to grant the Azure app permissions to the respective Data Lake service. In the case of the Data Lake Store. From the portal you can use the Data Explorer blades to assign folder permissions.

adl-grant-permissions

I really hope this post has saved you some time in figuring out how to authorise Data Lake services from Data Factory. Especially when developing beyond what the ADF Copy Wizard gives you.

Many thanks for reading.


Paul’s Frog Blog

Paul is a Microsoft Data Platform MVP with 10+ years’ experience working with the complete on premises SQL Server stack in a variety of roles and industries. Now as the Business Intelligence Consultant at Purple Frog Systems has turned his keyboard to big data solutions in the Microsoft cloud. Specialising in Azure Data Lake Analytics, Azure Data Factory, Azure Stream Analytics, Event Hubs and IoT. Paul is also a STEM Ambassador for the networking education in schools’ programme, PASS chapter leader for the Microsoft Data Platform Group – Birmingham, SQL Bits, SQL Relay, SQL Saturday speaker and helper. Currently the Stack Overflow top user for Azure Data Factory. As well as very active member of the technical community.
Thanks for visiting.
@mrpaulandrew