Azure

Azure SSIS – How to Setup, Deploy, Execute & Schedule Packages

Welcome back to work in 2018! 🙂

Let’s get stuck in with a hot topic. How do we actually use our beloved SQL Server Integration Services (SSIS) packages in Azure with all this new platform as a service (PaaS) stuff? Well, in this post I’m going to go through it end to end.

Post Contents

First, some caveats:

  1. Several of the Azure components required for this are still in public preview and can be considered as ‘not finished’. Meaning this is going to seem a little painful.
  2. The ADFv2 developer UI is still in private preview. But I’ve cheated and used it to generate the JSON to help you guys. Hopefully it’ll be available publicly soon.
  3. I’ve casually used my Microsoft sponsored Azure subscription and not had to worry about the cost of these services. I advise you check with the bill payer.
  4. Everything below has been done in a deliberate order. Especially the service setup.
  5. Everything below has been deployed in the same Azure region to avoid any cross data centre authentication unpleasantness. I suggest doing the same. I used EastUS for this post.

Ok, moving on…

Azure Services Setup

Now, let’s set some expectations. To get our SSIS packages in Azure we new a collection of services. When working on premises this gets neatly wrapped up with a pretty bow into something called SQL Server. Sadly in Azure there is no wrapping, no pretty bow and nothing that neat. Yet!

Azure Data Factory Version 2 (ADFv2)

First up, my friend Azure Data Factory. As you’ll probably already know, now in version 2 it has the ability to create recursive schedules and house the thing we need to execute our SSIS packages called the Integration Runtime (IR). Without ADF we don’t get the IR and can’t execute the SSIS packages. My hope would be that the IR would be a stand alone service, but for now its contained within ADF.

To deploy the service we can simply use the Azure portal blades. Whatever location you choose here make sure you use the same location for everything that follows. Just for ease. Also, it might be worth looking ahead to ensure everything you want is actually available in your preferred Azure region.

Lets park that service and move on.

Azure SQL Server Instance

Next, we need a logical SQL Server instance to house the SSIS database. Typically you deploy one of these when you create a normal Azure SQLDB (without realising), but they can be created on there own without any databases attached. To be clear, this is not an Azure SQL Server Managed Instance. It does not have a SQL Agent and is just the endpoint we connect to and authenticate against with some SQL credentials.

Again to deploy the service we can simply use the Azure portal blades. On this one make sure the box is checked to ‘Allow azure services to access server’ highlighted with the orange arrow below and of course make a note of the user name and password. If you don’t check the box ADF will not be able to create the SSISDB in the logical instance later on.

Once the SQL instance service is deployed. Go into the service blades and update the firewall rules to allow access from your current external IP address. This isn’t anything specifically required for SSIS, you need to do it for any SQLDB connections. Which is something that I always forget, so I’m telling you to help me remember! Thanks.


Azure SSIS IR

Next on the list, we need the shiny new thing, the SSIS IR, it needs creating and then starting up. In my opinion this is a copy of the SQL Server MsDtsSrvr.exe taken from the on premises product and used in the cloud on a VM that we don’t get access to… Under the covers it probably is, but I’m guessing.

Sadly for this we don’t have any nice Azure portal user interface for this yet. It’s going to need some PowerShell. Make sure you have your Azure modules up to date and run the following with the top set of variables assigned as required.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# Azure Data Factory version 2 information:
$SubscriptionId = ""
$ResourceGroupName = ""
$DataFactoryName = "" 
$DataFactoryLocation = ""
 
# Azure-SSIS integration runtime information:
$AzureSSISName = ""
$AzureSSISDescription = ""
 
$AzureSSISNodeSize = "Standard_A4_v2"
$AzureSSISNodeNumber = 2 
$AzureSSISMaxParallelExecutionsPerNode = 2 
$SSISDBPricingTier = "S1" 
 
# Azure Logical SQL instance information:
$SSISDBServerEndpoint = ".database.windows.net"
$SSISDBServerAdminUserName = ""
$SSISDBServerAdminPassword = ""
 
 
<# LEAVE EVERYTHING ELSE BELOW UNCHANGED #>
 
$SSISDBConnectionString = "Data Source=" + $SSISDBServerEndpoint + ";User ID="+ $SSISDBServerAdminUserName +";Password="+ $SSISDBServerAdminPassword
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection $SSISDBConnectionString;
Try
{
    $sqlConnection.Open();
}
Catch [System.Data.SqlClient.SqlException]
{
    Write-Warning "Cannot connect to your Azure SQL DB logical server/Azure SQL MI server, exception: $_"  ;
    Write-Warning "Please make sure the server you specified has already been created. Do you want to proceed? [Y/N]"
    $yn = Read-Host
    if(!($yn -ieq "Y"))
    {
        Return;
    } 
}
 
Login-AzureRmAccount
Select-AzureRmSubscription -SubscriptionId $SubscriptionId
 
Set-AzureRmDataFactoryV2 -ResourceGroupName $ResourceGroupName `
                        -Location $DataFactoryLocation `
                        -Name $DataFactoryName
 
$secpasswd = ConvertTo-SecureString $SSISDBServerAdminPassword -AsPlainText -Force
$serverCreds = New-Object System.Management.Automation.PSCredential($SSISDBServerAdminUserName, $secpasswd)
Set-AzureRmDataFactoryV2IntegrationRuntime  -ResourceGroupName $ResourceGroupName `
                                            -DataFactoryName $DataFactoryName `
                                            -Name $AzureSSISName `
                                            -Type Managed `
                                            -CatalogServerEndpoint $SSISDBServerEndpoint `
                                            -CatalogAdminCredential $serverCreds `
                                            -CatalogPricingTier $SSISDBPricingTier `
                                            -Description $AzureSSISDescription `
                                            -Location $DataFactoryLocation `
                                            -NodeSize $AzureSSISNodeSize `
                                            -NodeCount $AzureSSISNodeNumber `
                                            -MaxParallelExecutionsPerNode $AzureSSISMaxParallelExecutionsPerNode
 
write-host("##### Starting your Azure-SSIS integration runtime. This takes 20 to 30 minutes to complete. #####")
Start-AzureRmDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
                                             -DataFactoryName $DataFactoryName `
                                             -Name $AzureSSISName `
                                             -Force
 
write-host("##### Completed #####")
write-host("If any cmdlet is unsuccessful, please consider using -Debug option for diagnostics.")

I confess I’ve stolen this from Microsoft in there documentation here and tweaked it slightly to use the more precise subscription ID parameter as well as a couple of other things that I felt made life easier. While this is running you should get a process bar from the PowerShell ISE for the SSIS IR service starting, which really does take around 30mins. Be patient.

If you’d prefer to do this through the ADF PowerShell deployment cmdlets here is the JSON to use. Again assign values to the attributes as required. The JSON will create the SSIS IR, but it won’t start it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
"name": "",
"properties": {
	"type": "Managed",
	"description": "",
	"typeProperties": {
		"computeProperties": {
			"location": "EastUS",
			"nodeSize": "Standard_A4_v2",
			"numberOfNodes": 2,
			"maxParallelExecutionsPerNode": 2
		},
		"ssisProperties": {
			"catalogInfo": {
				"catalogServerEndpoint": "Your Instance.database.windows.net",
				"catalogAdminUserName": "user",
				"catalogAdminPassword": {
					"type": "SecureString",
					"value": "password"
				},
				"catalogPricingTier": "S1"
}}}}}

For info. The new developer UI gives you a wizard to go through the steps and a nice screen to see that the IR now exists. Until you get public access to this you’ll just have to assume its there.

Anyway, moving on. Once the thing has deployed and started you’ll now have an SSIS IR and also in your logical SQL instance the SSISDB. Exciting!

Open SSMS, making sure your are using version 17.2 or later. In the connection dialogue box as well as the usual bits go to options and explicitly set which database your connecting to. If you don’t the Integration Services branch won’t appear in SSMS Object Explorer. You’ll see the database tables, views, stored procs, but won’t have any of the SSIS options to control packages.

If all goes well you should get a very familiar sight…

Creating & Deploying an SSIS Package

As this is a ‘how do’ guide I’ve done something very simple in my package. It basically copies and pastes a CSV file from one Azure Data Lake Storage (ADLs) folder to another. I’m going to assume we are all familiar with more complex SSIS packages. Plus, the point of this post was getting the services working, not to do any data transformations.

SSIS Azure Feature Pack

What is probably worth pointing out is that if you want to work with Azure services in SSIS SQL Server Data Tools you need to install the Azure Feature Pack. Download and install it from the below link:
https://docs.microsoft.com/en-us/sql/integration-services/azure-feature-pack-for-integration-services-ssis

Once installed in your SSIS Toolbox (Control Flow/Data Flow) and Connection Manager you’ll have Azure services available.

For info, the Azure Data Lake Storage connection manager now offers the option to use a service principal to authenticate.


Package Deployment

Now I’m not going to teach a granny to suck eggs (or whatever the phrase is). To deploy the package you don’t need to do anything special. I simply created the ISPAC file in SSDT and used the project deployment wizard in SSMS. The deployment wizard from the project didn’t work in my version of SSDT running in Visual Studio 2015. Not sure why at this point so I used SSMS.

Package Execution

Similarly I’m going to assume we all know how to execution an SSIS package from management studio. It’s basically the same menu on the right where the deployment wizard gets launched. Granny, eggs, etc.

Or, we can execute a couple of stored procedures using some good old fashioned T-SQL (remember that?). See below.
 

1
2
3
4
5
6
7
8
9
10
11
DECLARE @execution_id bigint;  
 
EXEC [SSISDB].[catalog].[create_execution] 
	@package_name=N'DataLakeCopy.dtsx', 
	@execution_id=@execution_id OUTPUT,
	@folder_name=N'Testing',
	@project_name=N'AzureSSIS',
	@use32bitruntime=False; 
 
EXEC [SSISDB].[catalog].[start_execution] 
	@execution_id;

I mention this because we’ll need it when we schedule the package in ADF later.

Scheduling with ADFv2

Ok, now the fun part. Scheduling the package. Currently we don’t have a SQL Agent on our logical instance and we don’t have Elastic DTU Jobs (coming soon). Meaning we need to use ADF.

Thankfully in ADFv2 this does not involve provisioning time slices! Can I get a hallelujah? 🙂

This is the part where I cheated and used the new developer UI, but I’ll share all the JSON in case you don’t have a template for these bits in ADFv2 yet.

Linked Service to SQLDB

To allow ADF to access and authenticate against our logical SQL instance we need a linked service. We did of course already provide this information when creating the SSIS IR. But ADF needs them again to store and call for activity executions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
    "name": "SSISDB",
    "properties": {
        "type": "AzureSqlDatabase",
        "typeProperties": {
            "connectionString": {
                "type": "SecureString",
                "value": "
Integrated Security=False;
Encrypt=True;
Connection Timeout=30;
Data Source=;
Initial Catalog=;
User ID="
            }
        }
    }
}

The Pipeline

Nothing extra here, a very very simple pipeline similar to what you’ve previously seem in ADFv1. Only without the time slice schedule values and other fluff.

1
2
3
4
5
6
{
    "name": "RunSSISPackage",
    "properties": {
        "activities": []
    }
}

Stored Procedure Activity

Next, the main bit of the instruction set, the activity. You’ll know from the T-SQL above that in the SSISDB you need to first create an instance of the execution for the SSIS package. Then pass the execution ID to the start execution stored procedure. ADF still can’t handle this directly with one activity giving its output to the second, meaning we have to wrap up the T-SQL we want into a parameter for the sp_executesql stored procedure. Everything can be solved with more abstraction, right? 🙂

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
            {
                "name": "CreateExecution",
                "type": "SqlServerStoredProcedure",
                "dependsOn": [],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 20
                },
                "typeProperties": {
                    "storedProcedureName": "sp_executesql",
                    "storedProcedureParameters": {
                        "stmt": {
                            "value": "
Declare @execution_id bigint;  
EXEC [SSISDB].[catalog].[create_execution] 
@package_name=N'DataLakeCopy.dtsx', 
@execution_id=@execution_id OUTPUT,
@folder_name=N'Testing',
@project_name=N'AzureSSIS',
@use32bitruntime=False; 
 
EXEC [SSISDB].[catalog].[start_execution] 
@execution_id;"
                        }
                    }
                },
                "linkedServiceName": {
                    "referenceName": "SSISDB",
                    "type": "LinkedServiceReference"
                }
            }

Scheduled Trigger

Last but not least our scheduled trigger. Very similar to what we get in the SQL Agent, but now called ADF! For this post I went for 1:30pm daily as a test.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
    "name": "Daily",
    "properties": {
        "runtimeState": "Stopped", //change to Started
        "pipelines": [
            {
                "pipelineReference": {
                    "referenceName": "RunSSISPackage",
                    "type": "PipelineReference"
                },
                "parameters": {}
            }
        ],
        "type": "ScheduleTrigger",
        "typeProperties": {
            "recurrence": {
                "frequency": "Day",
                "interval": 1,
                "startTime": "2018-01-05T13:23:16.395Z",
                "timeZone": "UTC",
                "schedule": {
                    "minutes": [
                        30
                    ],
                    "hours": [
                        13
                    ]
                }
            }
        }
    }
}

The new UI gives you a nice agent style screen to create more complex schedules, even allowing triggers every minute if you wish. Here’s a teaser screen shot:

I hope this gave you the an end to end look at how to get your SSIS packages running in Azure and saved you looking through 10 different bits of Microsoft documentation.

Many thanks for reading

Back To Top


 

Business Intelligence in Azure – SQLBits 2018 Precon

What can you expect from my SQLBits pre conference training day in February 2018 at the London Olympia?

Well my friends, in short, we are going to take whirlwind tour of the entire business intelligence stack of services in Azure. No stone will be left unturned. No service will be left without scalability. We’ll cover them all and we certainly aren’t going to check with the Azure bill payer before turning up the compute on our data transforms.



What will we actually cover?

With new cloud services and advancements in locally hosted platforms developing a lambda architecture is becoming the new normal. In this full day of high level training we’ll learn how to architect hybrid business intelligence solutions using Microsoft Azure offerings. We’ll explore the roles of these cloud data services and how to make them work for you in this complete overview of business intelligence on the Microsoft cloud data platform.

Here’s how we’ll break that down during the day…

Module 1 – Getting Started with Azure

Using platform as a service products is great, but let’s take a step back. To kick off we’ll cover the basics for deploying and managing your Azure services. Navigating the Azure portal and building dashboards isn’t always as intuitive as we’d like. What’s a resource group? And why is it important to understand your Azure Activity Directory tenant?

Module 2 – An Overview of BI in Azure

What’s available for the business intelligence architect in the cloud and how might these services relate to traditional on premises ETL and cube data flows. Is ETL enough for our big unstructured data sources or do we need to mix things up and add some more letters to the acronym in the cloud?

Module 3 – Databases in Azure (SQL DB, SQL DW, Cosmos DB, SQL MI)

It’s SQL Server Jim, but not as we know it. Check out the PaaS flavours of our long term on premises friends. Can we trade the agent and an operating system for that sliding bar of scalable compute? DTU and DWU are here to stay with new SLA’s relating to throughput. Who’s on ACID and as BI people do we care?

Module 4 – The Azure Machines are here to Learn

Data scientist or developer? Azure Machine Learning was designed for applied machine learning. Use best-in-class algorithms in a simple drag-and-drop interface. We’ll go from idea to deployment in a matter of clicks. Without a terminator in sight!

Module 5 – Swimming in the Data Lake with U-SQL

Let’s understand the role of this hyper-scale two tier big data technology and how to harness its power with U-SQL, the offspring of T-SQL and C#. We’ll cover everything you need to know to get started developing solutions with Azure Data Lake.

Module 6 – IoT, Event Hubs and Azure Stream Analytics

Real-time data is everywhere. We need to use it and unlock it as a rich source of information that can be channelled to react to events, produce alerts from sensor values or in 9000 other scenarios. In this module, we’ll learn how, using Azure messaging hubs and Azure Stream Analytics.

Module 7 – Power BI, our Sematic Layer, is it All Things to All People?

Combining all our data sources in one place with rich visuals and a flexible data modelling tool. Power BI takes it all, small data, big data, streaming data, website content and more. But we really need a Venn diagram to decide when/where it’s needed.

Module 8 – Data Integration with Azure Data Factory and SSIS

The new integration runtime is here. But how do we unlock the scale out potential of our control flow and data flow? Let’s learn to create the perfect dependency driven pipeline for our data flows. Plus, how to work with the Azure Batch Service should you need that extensibility.

 

Finally we’ll wrap up the day by playing the Azure icon game, which you’ll all now be familiar with and able to complete with a perfect score having completed this training day 🙂

Many thanks for reading and I hope to see you in February, its going to be magic 😉

Register now: https://www.regonline.com/registration/Checkin.aspx?EventID=2023328

All training day content is subject to change, dependant on timings and the demo gods will!


 

What’s New in Azure Data Factory Version 2 (ADFv2)

I’m sure for most cloud data wranglers the release of Azure Data Factory Version 2 has been long overdue. Well good news friends. It’s here! So, what new features does the service now offer for handling our Azure data solutions?… In short, loads!

In this post, I’ll try and give you an overview of what’s new and what to expect from ADFv2. However, I’m sure more questions than answers will be raised here. As developers we must ask why and how when presented with anything. But let’s start somewhere.

Note: the order of the sub headings below was intentional.

Before diving into the new and shiny I think we need to deal with a couple of concepts to understand why ADFv2 is a completely new service and not just an extension of what version 1 offered.

Let’s compare Azure Data Factory Version 1 and Version 2 at a high level.

  • ADFv1 – is a service designed for the batch data processing of time series data.
  • ADFv2 – is a very general-purpose hybrid data integration service with very flexible execution patterns.

This makes ADFv2 a very different animal and something that now can handle scale out control flow and data flow patterns for all our ETL needs. Microsoft seemed to have got the message here, following lots of feedback from the community, that this is the framework we want for developing our data flows. Plus, is how we’ve been working for a long time with the very mature SQL Server Integration Services (SSIS).
 
 
 

Concepts:

Integration Runtime (IR)

Everything done in Azure Data Factory v2 will use the Integration Runtime engine. The IR is the core service component for ADFv2. It is to the ADFv2 JSON framework of instructions what the Common Language Runtime (CLR) is to the .Net framework.

Currently the IR can be virtualised to live in Azure, or it can be used on premises as a local emulator/endpoint. To give each of these instances their proper JSON label the IR can be ‘SelfHosted’ or ‘Managed’. To try and put that into context, consider the ADFv1 Data Management Gateway as a self-hosted IR endpoint (for now). This distinction between hosted and managed IR’s will also be reflected in the data movement costs on your subscription bill, but let’s not get distracted with pricing yet.

The new IR is designed to perform three operations:

  1. Move data.
  2. Execute ADF activities.
  3. Execute SSIS packages.

Of course, points 1 and 2 here aren’t really anything new as we could already do this in ADFv1, but point 3 is what should spark the excitement. It is this ability to transform our data that has been missing from Azure that we’ve badly needed.

With the IR in ADFv2 this means we can now lift and shift our existing on premises SSIS packages into the cloud or start with a blank canvas and create cloud based scale out control flow and data flow pipelines, facilitated by the new capabilities in ADFv2.

Without crossing any lines, the IR will become the way you start using SSIS in Azure, regardless of whether you decide to wrap it in ADFv2 or not.

Branching

This next concept I assume for anyone that’s used SSIS won’t be new. But it’s great to learn that we now have it available in the ADFv2 control flow (at an activity level).

Post execution our downstream activities can now be dependent on four possible outcomes as standard.

  • On success
  • On failure
  • On completion
  • On skip

Also, custom ‘if’ conditions will be available for branching based expressions (more on expressions later).


That’s the high-level concepts dealt with. Now, for ease of reading let’s break the new features down into two main sections. The service level changes and then the additions to our toolkit of ADF activities.

Service Features:

Web Based Developer UI

This won’t be available for use until later in the year but having a web based development tool to build our ADF pipelines is very exciting!… No more hand crafting the JSON. I’ll leave this point just with a sneaky picture. I’m sure this explains more than I can in words.

It will include an interface to GitHub for source control and the ability the execute the activities directly in the development environment.

For field mappings between source and destination the new UI will also support a drag and drop panel, like SSIS.

Better quality screen shots to follow as soon as its available.

Expressions & Parameters

Like most other Microsoft data tools, expressions give us that valuable bit of inline extensibility to achieve things more dynamically when developing. Within our ADFv2 JSON we can now influence the values of our attributes in a similar way using a rich new set of custom inner syntax, secondary to the ADF JSON. To support the expressions factory-wide, parameters will become first class citizens in the service.

As a basic example, before we might do something like this:

1
"name": "value"

Now we can have an expression and return the value from elsewhere, maybe using a parameter like this:

1
"name": "@parameters('StartingDatasetName')"

With the @ symbol becoming important here for the start of the inline expression. The expression syntax is rich and offers a host of inline functions to call and manipulate our service. These include:

  • String functions – concat, substring, replace, indexof etc.
  • Collection functions – length, union, first, last etc.
  • Logic functions – equals, less than, greater than, and, or, not etc.
  • Conversation functions – coalesce, xpath, array, int, string, json etc.
  • Math functions – add, sub, div, mod, min, max etc.
  • Date functions – utcnow, addminutes, addhours, format etc.

System Variables

As a good follow on from the new expressions/parameters available we now also have a handful of system variables to support our JSON. These are scoped at two levels with ADFv2.

  1. Pipeline scoped.
  2. Trigger scoped (more on triggers later).

The system variables extend the parameter syntax allowing us to return values like the data factory name, the pipeline name and a specific run ID. Variables can be called in the following way using the new @ symbol prefix to reference the dynamic content:

1
"attribute": "@pipeline().RunId"

Inline Pipelines

For me this is a deployment convenience thing. Before and currently our linked services, datasets and pipelines were separate JSON files within our Visual Studio solution. Now an inline pipeline can house all its required parts within its own properties. Personally, I like having a single reusable linked service for various datasets in one place that only needs updating with new credentials once. Why would you duplicate these settings as part of several pipelines? Maybe if you want some complex expressions to influence your data handling and you are limited by the scope of a system variable, an inline pipeline may then be required.

Anyway, this is what the JSON looks like:

1
2
3
4
5
6
7
8
9
{
    "name": "SomePipeline",
    "properties": {
		"activities": [], 		//before
		"linkedServices": [], 		//now available
		"datasets": [],			//now available
		"parameters": []		//now available
		}
}

Beware, if you use the ADF copy wizard via the Azure portal. An inline pipeline is what you’ll now get back.

Activity Retry & Pipeline Concurrency

In ADFv2 our activities will be categorised as control and non-control types. This is mainly to support the use of our new activities like ‘ForEach’ (more on the activity itself later). A ‘ForEach’ activity sits within the category of a control type. Meaning it will not have retry, long retry and concurrency options available within its JSON policy block. I think it’s logical that something like a sequential looping can’t concurrency run, so just be aware that such JSON attributes will now be validated depending on the category of the activity.

Our familiar and existing activities like ‘Copy’, ‘Hive’ and ‘U-SQL’ will therefore be categorised as non-control types with policy attributes remaining the same.

Event Triggers

Like our close friend Azure Logic Apps, ADFv2 can perform actions based on triggered events. So far, the only working example of this requires an Azure Blob Storage account that will output a file arrival event. It will be great to replace those time series polling activities that needed to keep retrying until the file appeared with this event based approach.

Scheduled Triggers

You guessed it. We can now finally schedule our ADF executions using a defined recursive pattern (with enough JSON). This schedule will sit above our pipelines as a separate component within ADFv2.

  • A trigger will be able to start multiple pipelines.
  • A pipeline can be started by multiple scheduled triggers.

Let’s look at some JSON to help with the understanding.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
{
  "properties": {
    "type": "ScheduleTrigger",
    "typeProperties": {
      "recurrence": {
        "frequency": Minute, Hour, Day, Week, Year,
        "interval": ,  // optional, how often to fire (default to 1)
        "startTime": ,
        "endTime": ,
        "timeZone": 
        "schedule": {  // optional (advanced scheduling specifics)
          "hours": 0-24,
          "weekDays": ": ,
          "minutes": 0-60,
          "monthDays": 1-31,
          "monthlyOccurences": [
               {
                    "day": ,
                    "occurrence": 1-5
               }
           ] 
      }
    },
   "pipelines": [ // pipeline here
            {
                "pipelineReference": {
                    "type": "PipelineReference",
                    "referenceName": ""
                },
                "parameters": {
                    "": {
                        "type": "Expression",
                        "value": ""
                    },
                    " : ""
                }
           }
      ]
  }
}

Tumbling Window Triggers

For me, ADFv1 time slices simply have a new name. A tumbling window is a time slice in ADFv2. Enough said on that I think.

Depends On

We know that ADF is a dependency driven tool in terms of datasets. But now activities are also dependency driven with the execution of one providing the necessary information for the execution of the second. The introduction of a new ‘DependOn’ attribute/clause can be used within an activity to drive this behaviour.

The ‘DependsOn’ clause will also provide the branching behaviour mentioned above. Quick example:

1
"dependsOn": [ { "dependencyConditions": [ "Succeeded" ], "activity": "DownstreamActivity" } ]

More to come with this explanation later when we talk about the new ‘LookUp’ activity.

Azure Monitor & OMS Integration

Diagnostic logs for various other Azure services have been available for a while in Azure Monitor and OMS. Now with a little bit of setup ADFv2 will be able to output much richer logs with various metrics available across a data factory services. These metrics will include:

  • Successful pipeline runs.
  • Failed pipeline runs.
  • Successful activity runs.
  • Failed activity runs.
  • Successful trigger runs.
  • Failed trigger runs.

This will be a great improvement on the current PowerShell or .Net work required with version 1 just to monitor issues at a high level.
If you want to know more about Azure Monitor go here: https://docs.microsoft.com/en-us/azure/monitoring-and-diagnostics/monitoring-overview-azure-monitor

PowerShell

It’s worth being aware that to support ADFv2 there will be a new set of PowerShell cmdlets available within the Azure module. Basically, all named the same as the cmdlets used for version 1 of the service, but now including ‘V2’ somewhere in the cmdlet name and accepting parameters specific to the new features.

Let’s start with the obvious one:

1
2
3
4
New-AzureRmDataFactoryV2 `
	-ResourceGroupName "ADFv2" `
	-Name "PaulsFunFactoryV2" `
	-Location "NorthEurope"

Or, a splatting friendly version for the PowerShell geeks 🙂

1
2
3
4
5
6
$parameters = @{
    Name = "PaulsFunFactoryV2"
    Location = "NorthEurope"
    ResourceGroupName = "ADFv2"
}
New-AzureRmDataFactoryV2  @parameters

Pricing

This isn’t a new feature as such, but probably worth mentioning that with all the new components and functionality in ADFv2 there is a new pricing model that you’ll need to do battle with. More details here: https://azure.microsoft.com/en-gb/pricing/details/data-factory/v2

Note: the new pricing tables for SSIS as a service with variations on CPU, RAM and Storage!


Activities:

Lookup

This is not an SSIS data transformation lookup! For ADFv2 we can lookup a list of datasets to be used in another downstream activity, like a Copy. I mentioned earlier that we now have a ‘DependsOn’ clause in our JSON, lookup is a good example of why we might use it.

Scenario: we have a pipeline containing two activities. The first lookups of some list of datasets (maybe some tables in a SQLDB). The second performs the data movement using the results of the lookup so it knows what to copy. This is very much a dataset level handling operation and not a row level data join. I think a picture is required:

Here’s a JSON snippet, which will probably be a familiar structure for those of you that have ever created an ARM Template.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
"name": "SomePipeline",
"properties": {
    "activities": [
        {
            "name": "LookupActivity", //First
            "type": "Lookup"
        },
        {
            "name": "CopyActivity", //Second
            "type": "Copy",              
            "dependsOn": [  //Dependancy
                {
                    "activity": "LookupActivity"
                }
            ],
            "inputs": [],  //From Lookup
            "outputs": []
        }
    ]        
}}

Currently the following sources can be used as lookups, all of which need to return a JSON dataset.

  • Azure Storage (Blob and Table)
  • On Premises Files
  • Azure SQL DB

HTTP

With the HTTP activity, we can call out to any web service directly from our pipelines. The call itself is a little more involved than a typical web hook and requires an XML job request to be created within a workspace. Like other activities ADF doesn’t handle the work itself. It passes off the instructions to some other service. In this case it uses the Azure Queue Service. The queue service is the compute for this activity that handles the request and HTTP response, if successful this get thrown back up to ADF.

There’s something about needing XML inside JSON for this activity that just seems perverse. So much so that I’m not going to give you a code snippet 🙂

Web (REST)

Our new web activity type is simply a REST API caller. Which I assume doesn’t require much more explanation. In ADFv1 if we wanted to make a REST call a custom activity was required and we needed C# for the interface interaction. Now we can do it directly from the JSON with child attributes to cover all the usual suspects for REST APIs:

  • URL
  • Method (GET, POST, PUT)
  • Headers
  • Body
  • Authentication

ForEach

The ForEach activity is probably self-explanatory for anyone with an ounce of programming experience. ADFv2 brings some enhancements to this. You can use a ForEach activity to simply iterate over a collection of defined items one at a time as you would expect. This is done by setting the IsSequential attribute of the activity to True. But you also have the ability to perform the activity in parallel, speeding up the processing time and using the scaling power of Azure.

For example: if you had a ‘ForEach’ Activity iterating over a ‘Copy’ operation, with 10 different items, with the attribute “isSequential” set to false, all copies will execute at once. ForEach then offers a new maximum of 20 concurrent iterations, compared to a signal non-control activity with its concurrency supporting only a maximum of 10.

To try and clarify, the ForEach activity accepts items and is developed as a recursive thing. But on execution you can chosoe to process them sequentially or in parallel (up to a maxuimum of 20). Maybe a picture will help:

Going even deeper, the ‘ForEach’ activity is not confined to only processing a single activity, it can also iterate over a collection of other activities, meaning we can nest activities in a workflow where ‘ForEach’ is the parent/master activity. The items clause for the looping still needs to be provided as a JSON array, maybe by an expression and parameter within your pipeline. But those items can reference another inner block of activities.

There will definitely be a follow up blog post on this one with some more detail and a better explanation, come back soon 🙂

Meta Data

Let’s start by defining what metadata is within the context of ADFv2. Meta data includes the structure, size and last modified date information about a dataset. A metadata activity will take a dataset as an input, and output the various information about what it’s found. This output could then be used as a point of validation for some downstream operation. Or, for some dynamic data transformation task that needs to be told what dataset structure to expect.

The input JSON for this dataset type needs to know the basic file format and location. Then the structure will be worked out based on what it finds.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
"name": "MyDataset",
"properties": {
"type": "AzureBlob",
	"linkedService": {
		"referenceName": "StorageLinkedService",
		"type": "LinkedServiceReference"
	},
	"typeProperties": {
		"folderPath":"container/folder",
		"Filename": "file.json",
		"format":{
			"type":"JsonFormat"
			"nestedSeperator": ","
		}
	}
}}

Currently, only datasets within Azure blob storage are supported.

I’m hoping you are beginning to see how branching, depends on condititions, expressions and parameters are bringing you new options when working with ADFv2, where one new features uses the other.


The next couple as you’ll know aren’t new activities, but do have some new options available when creating them.

Custom

Previously in our .Net custom activity code we could only pass static extended properties from the ADF JSON down to the C# class. Now we have a new ‘referenceObjects’ attribute that can be used to access information about linked services and datasets. Example JSON snippet below for an ADFv2 custom activity:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
  "name": "SomePipeline",
  "properties": {
    "activities": [{
      "type": "DotNetActivity",
      "linkedServiceName": {
        "referenceName": "AzureBatchLinkedService",
        "type": "LinkedServiceReference"
      },
		"referenceObjects": { //new bits
          "linkedServices": [],
		  "datasets": []
        },
        "extendedProperties": {}
}}}

This completes the configuration data for our C# methods giving us access to things like the connection credentials used in our linked services. Within the IDotNetActivity class we need the following methods to get these values.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
static void Main(string[] args)
{
    CustomActivity customActivity = 
        SafeJsonConvert.DeserializeObject(File.ReadAllText("activity.json"), 
        DeserializationSettings) as CustomActivity;
    List linkedServices = 
        SafeJsonConvert.DeserializeObject(File.ReadAllText("linkedServices.json"), 
        DeserializationSettings);
    List datasets = 
        SafeJsonConvert.DeserializeObject(File.ReadAllText("datasets.json"), 
        DeserializationSettings);
}
 
static JsonSerializerSettings DeserializationSettings
{
    get
    {
        var DeserializationSettings = new JsonSerializerSettings
        {
            DateFormatHandling = Newtonsoft.Json.DateFormatHandling.IsoDateFormat,
            DateTimeZoneHandling = Newtonsoft.Json.DateTimeZoneHandling.Utc,
            NullValueHandling = Newtonsoft.Json.NullValueHandling.Ignore,
            ReferenceLoopHandling = Newtonsoft.Json.ReferenceLoopHandling.Serialize
        };
        DeserializationSettings.Converters.Add(new PolymorphicDeserializeJsonConverter("type"));
        DeserializationSettings.Converters.Add(new PolymorphicDeserializeJsonConverter("type"));
        DeserializationSettings.Converters.Add(new PolymorphicDeserializeJsonConverter("type"));
        DeserializationSettings.Converters.Add(new TransformationJsonConverter());
 
        return DeserializationSettings;
    }
}

Copy

This can be a short one as we know what copy does. The activity now supports the following new data sources and destinations:

  • Dynamics CRM
  • Dynamics 365
  • Salesforce (with Azure Key Vault credentials)

Also as standard ‘copy’ will be able to return the number of rows processed as a parameter. This could then be used with a branching ‘if’ condition when the number of expected rows isn’t available for example.


Hopefully that’s everything and your now fully up to date with ADFv2 and all the new and exciting things it has to offer. Stay tuned for more in depth posts soon.

For more information check out the Microsoft documentation on ADF here: https://docs.microsoft.com/en-gb/azure/data-factory/introduction

Many thanks for reading.

 

Special thanks to Rob Sewell for reviewing and contributing towards the post.


Chaining Azure Data Factory Activities and Datasets

As I work with Azure Data Factory (ADF) and help others in the community more and more I encounter some confusion that seems to exist surrounding how to construct a complete dependency driven ADF solution. One that chains multiple executions and handles all of your requirements. In this post I hope to address some of that confusion and will allude to some emerging best practices for Azure Data Factory usage.

First a few simple questions:

  • Why is there confusion? In my opinion this is because the ADF copy wizard available via the Azure portal doesn’t help you architect a complete solution. It can be handy to reverse certain things, but really the wizard tells you nothing about the choices you make and what the JSON behind it is doing. Like most wizards, it just leads to bad practices!
  • Do I need several data factory services for different business functions? No, you don’t have to. Pipelines within a single data factory service can be disconnected for different processes and often having all your linked services in one place is easier to manage. Plus a single factor offers reusability and means I single set of source code etc.
  • Do I need one pipeline per activity. No, you can house many activities in a single pipeline. Pipelines are just logic containers to assist you when managing data orchestration tasks. If you want an SSIS comparison, think of them as sequence containers. In a factory I may group all my on premises gateway uploads into a single pipeline. This means I can pause that stream of uploads on demand. Maybe when the gateway keys needs to be refreshed etc.
  • Is the whole data factory a pipeline? Yes, in concept. But for technical terminology a pipeline is a specific ADF component. The marketing people do love to confuse us!
  • Can an activity support multiple inputs and multiple outputs? Generally yes. But there are exceptions depending on the activity type. U-SQL calls to Azure Data Lake can have multiples of both. ADF doesn’t care as long as you know what the called service is doing. On the other hand a copy activity needs to be one to one (so Microsoft can charge more for data movements).
  • Does an activity have to have an input dataset? No. For example, you can create a custom activity that executes your code for a defined time slice without an input dataset, just the output.

Datasets

Moving on, lets go a little deeper and think about a scenario that I use in my community talks. We have an on premises CSV file. We want to upload it. Clean it and aggregate the output. For each stage of this process we need to define a dataset for Azure Data Factory to use.

To be clear, a dataset in this context is not the actual data. It is just a set of JSON instructions that defines where and how our data is stored. For example, its file path, its extension, its structure, its relationship to the executing time slice.

Lets define each of the datasets we need in ADF to complete the above scenario for just 1 file:

  1. The on premises version of the file. Linked to information about the data management gateway to be used, with local credentials and file server/path where it can be accessed.
  2. A raw Azure version of the file. Linked to information about the data lake storage folder to be used for landing the uploaded file.
  3. A clean version of the file. Linked to information about the output directory of the cleaning process.
  4. The aggregated output file. Linked to information about the output directory of the query being used to do the aggregation.

All of the linked information to these datasets should come from your ADF linked services.

So, we have 1 file to process, but in ADF we now need 4 datasets defined for each stage of the data flow. These datasets don’t need to be complex, something as simple as the following bit of JSON will do.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
  "name": "LkpsCurrencyDataLakeOut",
  "properties": {
    "type": "AzureDataLakeStore",
    "linkedServiceName": "DataLakeStore",
    "structure": [ ],
    "typeProperties": {
      "folderPath": "Out",
      "fileName": "Dim.Currency.csv"
    },
    "availability": {
      "frequency": "Day",
      "interval": 1
    }
  }
}

Activities

Next, our activities. Now the datasets are defined above we need ADF to invoke the services that are going to do the work for each stage. As follows:

Activity (JSON Value) Task Description Input Dataset Output Dataset
Copy Upload file from local storage to Data Lake storage. 1 2
DotNetActivity Perform transformation/cleaning on raw source file. 2 3
DataLakeAnalyticsU-SQL Aggregate the datasets to produce a reporting output. 3 4

From the above table we can clearly see the output dataset of the first activity becomes the input of the second. The output dataset of the second activity becomes the input of the third. Apologies if this seems obvious, but I have know it to confuse people.

Pipelines

For our ADF pipeline(s) we can now make some decisions about how we want to manage the data flow.

  1. Add all the activities to a single pipeline meaning we can stop/start everything for this 1 dataset end to end.
  2. Add each activity to a different pipeline dependant on its type. This is my starting preference.
  3. Have the on premises upload in one pipeline and everything else in a second pipeline.
  4. Maybe separate your pipelines and data flows depending on the type of data. Eg. Fact/dimension. Finance and HR.

The point here, is that it doesn’t matter to ADF, it’s just down to how you want to control it. When I created the pipelines for my talk demo I went with option 2. Meaning I get the following pretty diagram, arranged to fit the width of my blog 🙂

Here we can clearly see at the top level each dataset flowing into a pipeline and its child activity. If I’m constructed this using option 1 above I would simply see the first dataset and the fourth with 1 pipeline box. I could then drill into the pipeline to see the chain activities within. A repeat, this doesn’t matter to ADF.

I hope you found the above useful and a good starting point for constructing your ADF data flows.

Best Practices

As our understanding of Azure Data Factory matures I’m sure some of the following points will need to be re-written, but for now I’m happy to go first and start laying the ground work of what I consider to be best for ADF usage. Comments very welcome.

  1. Resist using the wizard, please.
  2. Keep everything within a single ADF service if you can. Meaning linked services can be reused.
  3. Disconnect your on premises uploads using a single pipeline. For ease of management.
  4. Group your activities into natural pipeline containers for the operation type or data category.
  5. Layout your ADF diagram carefully. Left to right. It makes understanding it much easier for others.
  6. Use Visual Studio configuration files to deploy ADF projects between Dev/Test/Live. Ease of source control and development.
  7. Monitor activity concurrency and time outs carefully. ADF will kill called service executions if breached.
  8. Be mindful of activity cost and group inputs/outputs for data compute where possible.
  9. Use time slices to control your data volumes. Eg. Pass the time slice as a parameter to the called compute service.

What next? Well, I’m currently working on this beast…

  • 127x datasets.
  • 71x activities.
  • 9x pipelines.

… and I’ve got about another third left to build!

Many thanks for reading.


Azure Business Intelligence – The Icon Game!

As Azure becomes the new normal for many organisations our architecture diagrams become ever more complicated. Articulating our designs/data flows to management or technical audiences therefore requires a new group of cloud service icons in our pretty pictures. Especially for hybrid solutions. Sadly those icons aren’t yet that familiar for most. So, here’s a very simple blog post to help you recognise what’s in the Azure stack from a Purple Frog business intelligence perspective, in no particular order.

All of the following having been snipped from the Azure portal dashboard so there shouldn’t be any surprises once you start working with these services.

  Azure
    Data Catalogue   Data Factory
  Batch Service   Data Lake Storage
  Data Lake Analytics   Power BI
  Cosmos DB   IoT Hub
  Event Hub   Stream Analytics
  Machine Learning   SQL DB
  SQL DW   Logical SQL Server
  Data Management Gateway         Analysis Services
  Resources   Virtual Machine
  Azure Active Directory   Blob Storage

Happy drawing!


Calling U-SQL Stored Procedures with C# Code Behind

So friends, some more lessons learnt when developing with U-SQL and Azure Data Lake. I’ll try and keep this short.

Problem

You have a U-SQL stored procedure written and working fine within your Azure Data Lake Analytics service. But we need to add some more business logic or something requiring a little C# magic. This is the main thing I love about U-SQL, having that C# code behind file where I can extend my normal SQL behaviour. So, being a happy little developer you write your class and method to support the U-SQL above and you recreate your stored procedure. Great!

Next, you try to run that stored procedure…

[ExampleDatabase].[dbo].[SimpleProc]();

But are hit with an error, similar to this:

E_CSC_USER_INVALIDCSHARP: C# error CS0103: The name ‘SomeNameSpaceForCodeBehind’ does not exist in the current context.


Why?

Submitting U-SQL queries containing C# code behind methods works fine normally. But once you wrap it up as a stored procedure within the ADL analytics database the complied C# is lost. Almost as if the U-SQL file/procedure no longer has its lovely code behind file at all!

Just to be explicit with the issue. Here is an example stored procedure that I’ve modified from the Visual Studio U-SQL Sample Application project. Note my GetHelloWord method that I’ve added just for demonstration purposes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
DROP PROCEDURE IF EXISTS [dbo].[SimpleProc];
 
CREATE PROCEDURE [dbo].[SimpleProc]()
AS
BEGIN
 
    @searchlog =
        EXTRACT UserId INT,
                START DateTime,
                Region string,
                Query string,
                Duration INT?,
                Urls string,
                ClickedUrls string
        FROM "/Samples/Data/SearchLog.tsv"
        USING Extractors.Tsv();
 
    @WithCodeBehind =
        SELECT 
            *,
            SomeNameSpaceForCodeBehind.MyCodeBehind.GetHelloWorld() AS SomeText
        FROM @searchlog;
 
    OUTPUT @WithCodeBehind
    TO "/output/SearchLogResult1.csv"
    USING Outputters.Csv();
 
END;

This U-SQL file then has the following C#, with my totally original naming conventions. No trolls please, this is not the point of this post 🙂

1
2
3
4
5
6
7
8
9
10
11
namespace SomeNameSpaceForCodeBehind
{
    public class MyCodeBehind
    {
        static public string GetHelloWorld()
        {
            string text = "HelloWorld";
            return text;
        }
    }
}

So, this is what doesn’t work. Problem hopefully clearly defined.

Solution

To work around this problem, instead of using a C# code behind file for the procedure we need to move the class into its own assembly. This requires a little more effort and plumbing, but does solve this problem. Plus, this approach is probably more familiar to people that have ever worked with CLR functions in SQL Sever that they want to use within a stored procedure.

This is what we need to do.

  • Add a C# class library to your Visual Studio solution and move the U-SQL code behind into a library name space.

  • Build the library and use the DLL to create an assembly within the ADL analytics database. The DLL can live in your ADL store root, in line it or create it from Azure Blob Store. I have another post on that here if your interested.
CREATE ASSEMBLY IF NOT EXISTS [HelloWorld] FROM "assembly/ClassLibrary1.dll";
  • Finally, modify your stored procedure to use the assembly instead of the code behind name space. The new stored procedure should look like this.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
DROP PROCEDURE IF EXISTS [dbo].[SimpleProc];
 
CREATE PROCEDURE [dbo].[SimpleProc]()
AS
BEGIN
 
    //Complied library:
    REFERENCE ASSEMBLY [HelloWorld];
 
    @searchlog =
        EXTRACT UserId INT,
                START DateTime,
                Region string,
                Query string,
                Duration INT?,
                Urls string,
                ClickedUrls string
        FROM "/Samples/Data/SearchLog.tsv"
        USING Extractors.Tsv();
 
    @WithCodeBehind =
        SELECT *,
               //Changed TO USE assembly:
               HelloWorld.ClassLibrary1.GetHelloWorld() AS SomeText
        FROM @searchlog;
 
    OUTPUT @WithCodeBehind
    TO "/output/SearchLogResult1.csv"
    USING Outputters.Csv();
 
END;

This new procedure executes without error and gets around the problem above.

I hope this helps and allows you to convert those complex U-SQL scripts to procedures, while retaining any valuable code behind functionality.

Many thanks for reading

Creating a U-SQL Date Dimension & Numbers Table in Azure Data Lake

Now we all know what a date dimension is and there are plenty of really great examples out there for creating them in various languages. Well, here’s my U-SQL version creating the output from scratch using a numbers table. Remember that U-SQL needs to be handled slightly differently because we don’t have any iterative functionality available. Plus its ability to massively parallelise jobs means we can’t write something that relies on procedural code.

This is version 1…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
//Enter start and end dates for the date dimension:
DECLARE @StartDate DateTime = new DateTime(2017,1,1);
DECLARE @EndDate DateTime = new DateTime(2018,12,31);
 
//Create numbers table
@Numbers0 = SELECT * FROM (VALUES(0)) AS Row (Number);
@Numbers1 = SELECT [Number] FROM @Numbers0 
    UNION ALL SELECT [Number]+1 AS Number FROM @Numbers0;
@Numbers2 = SELECT [Number] FROM @Numbers1 
    UNION ALL SELECT [Number]+2 AS Number FROM @Numbers1;
@Numbers4 = SELECT [Number] FROM @Numbers2 
    UNION ALL SELECT [Number]+4 AS Number FROM @Numbers2;
@Numbers8 = SELECT [Number] FROM @Numbers4 
    UNION ALL SELECT [Number]+8 AS Number FROM @Numbers4;
@Numbers16 = SELECT [Number] FROM @Numbers8 
    UNION ALL SELECT [Number]+16 AS Number FROM @Numbers8;
@Numbers32 = SELECT [Number] FROM @Numbers16 
    UNION ALL SELECT [Number]+32 AS Number FROM @Numbers16;
@Numbers64 = SELECT [Number] FROM @Numbers32 
    UNION ALL SELECT [Number]+64 AS Number FROM @Numbers32;
//Double it again if you want it bigger...
 
//Create date dimension
@DateDimension = 
SELECT 
    int.Parse([Date].ToString("yyyyMMdd")) AS DateKey,
    [Date],
    [Date].ToString("dd/MM/yyyy") AS DateString,
    [Date].Day AS Day,
    [Date].Month AS Month,
    Math.Floor(((decimal)[Date].Month + 2) / 3) AS Quater,
    [Date].Year AS Year,
    Convert.ToInt32([Date].DayOfWeek) + 1 AS DayOfWeekNo,
    [Date].ToString("dddd") AS DayName,
    [Date].ToString("MMMM") AS MonthName,
    [Date].Month >=4 ? [Date].ToString("yyyy")+"/"+([Date].AddYears(+1)).ToString("yy") 
        : ([Date].Year - 1).ToString() + "/" + [Date].ToString("yy") AS FinancialYear,
    DateTimeFormatInfo.CurrentInfo.Calendar.GetWeekOfYear(
        [Date], CalendarWeekRule.FirstDay, System.DayOfWeek.Sunday) AS WeekNoOfYear
FROM
    (
    SELECT 
        @StartDate.AddDays(Convert.ToDouble([RowNumber]) -1) AS Date
    FROM 
        (
        SELECT
            ROW_NUMBER() OVER (ORDER BY n1.[Number]) AS RowNumber
        FROM 
            @Numbers64 AS n1
            CROSS JOIN @Numbers64 AS n2 //make it big!
        ) AS x
    ) AS y
WHERE
    [Date] <= @EndDate; //cheat to cut off results
 
 
//Output files
OUTPUT @DateDimension
TO "/Stuff/DateDimension.csv"
ORDER BY [Date] ASC
USING Outputters.Csv(quoting : true, outputHeader : true);
 
//Get a numbers table as a bonus :-)
OUTPUT @Numbers64
TO "/Stuff/Numbers.csv"
ORDER BY [Number] ASC
USING Outputters.Csv(quoting : true, outputHeader : true);

In version 2 I may use a replicated string that EXPLODE’s from an array rather than using a numbers table. But that’s for another time. I included the numbers table as an output in this one as a little bonus 🙂

Hope this helps you out while swimming in the Azure Data Lake.

Many thanks for reading.


Using Azure Data Factory Configuration Files

Like most things developed its very normal to have multiple environments for the same solution; dev, test, prod etc. Azure Data Factory is no exception to this. However, where it does differ slightly is the way it handles publishing to different environments from the Visual Studio tools provided. In this post we’ll explore exactly how to create Azure Data Factory (ADF) configuration files to support such deployments to different Azure services/directories.

For all the examples in this post I’ll be working with Visual Studio 2015 and the ADF extension available from the market place or via the below link.

https://marketplace.visualstudio.com/items?itemName=AzureDataFactory.MicrosoftAzureDataFactoryToolsforVisualStudio2015

Before we move on lets take a moment to say that Azure Data Factory configuration files are purely a Visual Studio feature. At publish time Visual Studio simply takes the config file content and replaces the actual JSON attribute values before deploying in Azure. That said, to be explicit.

  • An ADF JSON file with attribute values missing (because they come from config files) cannot be deployed using the Azure portal ‘Author and Deploy’ blade. This will just fail validation as missing content.
  • An ADF JSON config file cannot be deployed using the Azure portal ‘Author and Deploy’ blade. It is simply not understood by the browser based tool as a valid ADF schema.

Just as an aside. Code comments in ADF JSON files are also purely a Visual Studio feature. You can only comment your JSON in the usual way in Visual Studio, which at publish time will strip these out for you. Any comments left in code that you copy and paste into the Azure portal will return as syntax errors! I have already given feedback to the Microsoft product team that code comments in the portal blades would be really handy. But I digress.

Apologies in advance if I switch between the word publish and deploy too much. I mean the same thing. I prefer deploy, built in a Visual Studio ADF solution its called publish in the menu.

Creating a ADF Configuration File

First lets use the Visual Studio tools to create a common set of configuration files. In a new ADF project you have the familiar tree including Linked Services, Pipelines etc. Now right click on the project and choose Add > New Item. In the dialogue presented choose  Config and add a Configuration File to the project, with a suitable name.

I went to town and did a set of three 🙂

Each time you add a config file to your ADF project. Or any component for that matter. You’ll be aware that Visual Studio tries to help you out by giving you a JSON template or starter for what you might want. This is good, but in the case of ADF config files isn’t that intuitive. Hence this blog post. Lets move on.

Populating the Configuration File

Before we do anything let me attempt to put into word what we need to do here. Every JSON attribute has a reference of varying levels to get to its value.  When we recreate a value in our config file we need to recreate this reference path exactly from the top level of the component name. In the config file this goes as a parent (at the same level as schema) followed by square brackets [ ] which then contain the rest of the content we want to replace. Next within the square brackets of the component we need pairs of attributes (name and value). These represent the references to the actual component structure. In the ‘name’ value we start with a $. which represents the root of the component file. Then we build up the tree reference with a dot for each new level. Lastly, the value is as it says. Just the value to be used instead of whatever may be written in the actual component file.

Make sense? Referencing JSON with JSON? I said it wasn’t intuitive. Lets move on and see it.

Lets populate our configuration files with something useful. This of course greatly depends on what your data factory is doing as to what values you might want to switch between environments, but lets start with a few common attributes. For this example lets alter a pipelines schedule start, end and paused values. I always publish to dev as paused to give me more control over running the pipeline.

At the bottom of our pipeline component file I’ve done the following.

1
2
3
4
5
6
7
8
9
    //etc...
	//activities block
	],
    "start": "1900-01-01", /*<get from config file>*/
    "end": "1900-01-01", /*<get from config file>*/
    "isPaused": /*<get from config file>*/,
    "pipelineMode": "Scheduled"
  }
}

… Which means in my config file I need to create the equivalent set of attribute references and values. Note; the dollar for the root, then one level down into the properties namespace. Then another dot before the attribute.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
  "ExactNameOfYourPipeline": [ // <<< Component name. Exactly!
    {
      "name": "$.properties.isPaused",
      "value": true
    },
    {
      "name": "$.properties.start",
      "value": "2016-08-01"
    },
    {
      "name": "$.properties.end",
      "value": "2017-06-01"
    }
  ]
}

A great thing about this approach with ADF tools in Visual Studio is that any attribute value can be overridden with something from a config file. It’s really flexible and each component can be added in the same way regardless of type. There are however some quirks/features to be aware of, as below.

  • All parent and child name referencing within the config file must match its partner in the actual component JSON file exactly.
  • All referencing is case sensitive. But Visual Studio won’t validate this for you in intellisense or when building the project.
  • In the actual component file some attribute values can be left blank as they come from config. Others cannot and will result in the ADF project failing to build.
  • For any config referencing that fails. You’ll only figure this out when you publish and check the Azure portal to see that the JSON file in place has its original content. Fun.

Right then. Hope that’s all clear as mud 🙂

Publishing using Different Configurations

Publishing is basically the easy bit. Involving a wizard so I don’t need to say much here.

Right click on the project in Visual Studio and choose Publish. In the publish items panel of the wizard simply select the config file you want to use for the deployment.

I hope this post of helpful and saved you some time when developing with ADF.

Many thanks for reading.


Writing a U-SQL Merge Statement

Unlike T-SQL, U-SQL does not currently support MERGE statements. Our friend that we have come to know and love since its introduction in SQL Server 2008. Not only that, but U-SQL also doesn’t currently support UPDATE statements either… I know… Open mouth emoji required! This immediately leads to the problem of change detection in our data and how, for example, we should handle the ingestion of a daily rolling 28-day TSV extract, requiring a complete year to date output. Well in this post we will solve that very problem.

Now before we move on it’s worth pointing out that U-SQL is now our/my language of choice for working with big data in Azure using the Data Lake Analytics service. It’s not yet the way of things for our on premises SQL Server databases, so relax. T-SQL or R are still our out-of-the-box tools there (SQL Server 2016). Also if you want to take a step back from this slightly deeper U-SQL topic and find out What is U-SQL first, I can recommend my purple amphibious colleagues blog, link below.

https://www.purplefrogsystems.com/blog/2016/02/what-is-u-sql/

Assuming you’re comfortable with the U-SQL basics let’s move on. For the below examples I’m working with Azure Data Lake (ADL) Analytics, deployed in my Azure MSDN subscription. Although we can do everything here in the local Visual Studio emulator, without the cloud service (very cool and handy for development). I also have the Visual Studio Data Lake tools for the service available and installed. Specifically for this topic I have created a ‘U-SQL Sample Application’ project to get us started. This is simply for ease of explanation and so you can get most of the setup code for what I’m doing here without any great difficulty. Visual Studio Data Lake tools download link below if needed.

https://www.microsoft.com/en-us/download/details.aspx?id=49504

newdatalakevssampleapp

Once we have this solution available including its Ambulance and Search Log samples please find where your Visual Studio Cloud Explorer panel (Ctrl + \, Ctrl + X) is hiding as we’ll use this to access the local Data Lake Analytics database on your machine.

vscloudandsolutionpanelsDatabase Objects

To get things rolling open the U-SQL file from the sample project called ‘SearchLog-4-CreatingTable’ and execute AKA ‘Submit’ this to run locally against your ADL Analytics instance. This gives us a database and target table to work with for the merge operation. It also inserts some tab separated sample data into the table. If we don’t insert this initial dataset you’ll find joining to an empty table will prove troublesome.

Now U-SQL is all about extracting and outputting at scale. There isn’t any syntax sugar to merge datasets. But do we really need the sugar? No. So, we are going to use a database table as our archive or holding area to ensure we get the desired ‘upsert’ behaviour. Then write our MERGE statement long hand using a series of conventional joins. Not syntax sugar. Just good old fashioned joining of datasets to get the old, the new and the changed.

Recap of the scenario; we have a daily rolling 28-day input, requiring a full year to date output.

Merging Data

Next open the U-SQL file from the sample project called ‘SearchLog-1-First_U-SQL_Script’. This is a reasonable template to adapt as it contains the EXTRACT and OUTPUT code blocks already.

For the MERGE we next need a set of three SELECT statements joining both the EXTRACT (new/changed data) and table (old data) together. These are as follows, in mostly English type syntax first 🙂

  • For the UPDATE, we’ll do an INNER JOIN. Table to file. Taking fields from the EXTRACT only.
  • For the INSERT, we’ll do a LEFT OUTER JOIN. File to table. Taking fields from the EXTRACT where NULL in the table.
  • To retain old data, we’ll do a RIGHT OUTER JOIN. File to table. Taking fields from the table where NULL in the file.

Each of the three SELECT statements can then have UNION ALL conditions between them to form a complete dataset including any changed values, new values and old values loaded by a previous file. This is the code you’ll want to add for the example in your open file between the extract and output code blocks. Please don’t just copy and paste without understanding it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
@AllData =
    --update current
    SELECT e1.[UserId],
           e1.[START],
           e1.[Region],
           e1.[Query],
           e1.[Duration],
           e1.[Urls],
           e1.[ClickedUrls]
    FROM [SearchLogDemo].[dbo].[SearchLog] AS t1
         INNER JOIN
             @searchlog AS e1
         ON t1.[UserId] == e1.[UserId]
 
    UNION ALL
 
    --insert new
    SELECT e2.[UserId],
           e2.[START],
           e2.[Region],
           e2.[Query],
           e2.[Duration],
           e2.[Urls],
           e2.[ClickedUrls]
    FROM @searchlog AS e2
         LEFT OUTER JOIN
             [SearchLogDemo].[dbo].[SearchLog] AS t2
         ON t2.[UserId] == e2.[UserId]
    WHERE
    t2.[UserId] IS NULL
 
    UNION ALL
 
    --keep existing
    SELECT t3.[UserId],
           t3.[START],
           t3.[Region],
           t3.[Query],
           t3.[Duration],
           t3.[Urls],
           t3.[ClickedUrls]
    FROM @searchlog AS e3
         RIGHT OUTER JOIN
             [SearchLogDemo].[dbo].[SearchLog] AS t3
         ON t3.[UserId] == e3.[UserId]
    WHERE
    e3.[UserId] IS NULL;

This union of data can then OUTPUT to our usable destination doing what U-SQL does well before resetting our ADL Analytics table for the next load. By reset, I mean TRUNCATE the table and INSERT everything from @AllData back into it. This preserves our history/our old data and allows the MERGE behaviour to work again and again using only SELECT statements.

Replacing the OUTPUT variable from @searchlog, you’ll then want to add the following code below the three SELECT statements.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
OUTPUT @AllData
TO "/output/SearchLogAllData.csv"
USING Outputters.Csv();
 
TRUNCATE TABLE [SearchLogDemo].[dbo].[SearchLog];
 
INSERT INTO [SearchLogDemo].[dbo].[SearchLog]
(
    [UserId],
    [START],
    [Region],
    [Query],
    [Duration],
    [Urls],
    [ClickedUrls]
)
SELECT [UserId],
       [START],
       [Region],
       [Query],
       [Duration],
       [Urls],
       [ClickedUrls]
FROM @AllData;

If all goes well you can edit the ‘SearchLog.tsv’ file removing and changing data and keep rerunning the U-SQL script performing the MERGE behaviour. Please test away. Don’t just believe me that it works. As a bonus you get this pretty job diagram too…

localusqlmergejob

The only caveat here is that we can’t deal with deletion detection from the source file… Unless we do something a little more complex for the current loading period. Lets save that for a later blog post.

A couple of follow up general tips.

  • Have a USE condition at the top of your scripts to ensure you hit the correct database. Just like T-SQL.
  • If your struggling for fields to join on as you don’t have a primary key. You could use UNION instead of UNION ALL. But this of course takes more effort to work out the distinct values. Just like T-SQL.
  • Be careful with C# data types and case sensitivity. U-SQL is not as casual as T-SQL with such things.

That’s it. U-SQL merge behaviour achieved. I guess the bigger lesson here for the many T-SQL people out there is; don’t forget the basics, its still a structured query language. Syntax sugar is sweet, but not essential.

Hope this was helpful.

Many thanks for reading


Azure Data Lake Authentication from Azure Data Factory

rereTo set the scene for the title of this blog post lets firstly think about other services within Azure. You’ll probably already know that most services deployed require authentication via some form of connection string and generated key. These keys can be granted various levels of access and also recycled as required, for example an IoT Event Hub seen below (my favourite service to play with).

levelskeysandconnectionstrings

Then we have other services like SQLDB that require user credentials to authenticate as we would expect from the on premises version of the product. Finally we have a few other Azure services that handle authentication in a very different way altogether requiring both user credentials initially and then giving us session and token keys to be used by callers. These session and token keys are a lot more fragile than connection strings and can expire or become invalid if the calling service gets rebuilt or redeployed.

In this blog post I’ll explore and demonstrate specifically how we handle session and token based authentication for Azure Data Lake (ADL), firstly when calling it as a Linked Service from Azure Data Factory (ADF), then secondly within ADF custom activities. The latter of these two ADF based operations becomes a little more difficult because the .Net code created and compiled is unfortunately treated as a distant relative to ADF requiring its own authentication to ADL storage as an Azure Application. To further clarify a Custom Activity in ADF does not inherit its authorising credentials from the parent Linked Service, it is responsible for its own session/token. Why? Because as you may know from reading my previous blog post; Custom Activates get complied and executed by an Azure Batch Service. Meaning the compute for the .Net code is very much detached from ADF.

At this point I should mention that this applies to Data Lake Analytics and Data Lake Storage. Both require the same approach to authentication.

Data Lake as a Service Within Data Factory

adf-author-and-deploy-buttonThe easy one first, adding an Azure Data Lake service to your Data Factory pipeline. From the Azure portal within the ADF Author and Deploy blade you simply add a new Data Lake Linked Service which returns a JSON template for the operation into the right hand panel. Then we kindly get provided with an Authorize button (spelt wrong) at the top of the code block.

Clicking this will pop up with the standard Microsoft login screen requesting work or personal user details etc. Upon competition or successful authentication your Azure subscription will be inspected. If more than one applicable service exists, you’ll of course need to select which you require authorisation for. But once done you’ll return to the JSON template now with a completed Authorization value and SessionId.

adf-adl-json-templateJust for information and to give you some idea of the differences in this type of authorisation compared to other Azure services. When I performed this task for the purpose of creating screen shots in this post the resulting Authorization URL was 1219 characters long and the returned SessionId was 1100! Or half a page of a standard Word document each. By comparison an IoT Hub key is only 44 characters. Furthermore, the two values are customised to the service that requested them and can only be used within the context where they were created.

For completeness, because we can also now develop ADF pipelines from Visual Studio it’s worth knowing that a similar operation is now available as part of the Data Factory extension. In Visual Studio within your ADF project on the Linked Service branch you are able to Right Click > Add > New Item and choose Data Lake Store or Analytics. You’ll then be taken through a wizard (similar in look to that of the ADF deployment wizard) which requests user details, the ADF context and returns the same JSON template with populated authorising values.

vs-adf-adl-addservice

A couple of follow up tips and lessons learnt here:

  • If you tell Visual Studio to reverse engineer your ADF pipeline from a current Azure deployed factory where an existing ADL token and session ID are available. These will not be brought into Visual Studio and you’ll need to authorise the service again.
  • If you copy an ADL JSON template from the Azure portal ‘Author and Deploy’ area Visual Studio will not popup the wizard to authorise the service and you’ll need to do it again.
  • If you delete the ADL Linked Service within the portal ‘Author and Deploy’ area. The same Linked Service tokens in Visual Studio will become invalid and you’ll need to authorise the service again.
  • If you sneeze to loudly while Visual Studio is open you’ll need to authorise the service again.

Do you get the idea when I said earlier that the authorisation method is fragile? Very sophisticated, but fragile when chopping and changes things during development.

What you may find yourself doing fairly frequently is:

  1. Deploying an ADF project from Visual Studio.
  2. The deployment wizard failing telling you the ADL tokens have expired or are no longer authorised.
  3. Adding a new Linked Service to the project just to get the user authentication wizard popup.
  4. Then copying the new token and session values into the existing ADL Linked Service JSON file.
  5. Then excluding the new services you created just to re-authorise from the Visual Studio project.

Fun! Moving on.

Update: you can use an Azure AD service principal to authenticate both Azure Data Lake Store and Azure Data Lake Analytics services from ADF. Details are included in this post: https://docs.microsoft.com/en-gb/azure/data-factory/v1/data-factory-azure-datalake-connector#azure-data-lake-store-linked-service-properties

Data Factory Custom Activity Call Data Lake

Next the slightly more difficult way to authenticate against ADL, using an ADF .Net Custom Activity. As mentioned previously the .Net code once sent to Azure as a DLL is treated as a third party application requiring its own credentials.

The easiest way I’ve found to getting this working is firstly to use PowerShell to register the application in Azure which using the correct CMDLets returns an application GUID and password which when combined give the .Net code its credentials. Here’s the PowerShell you’ll need below. Be sure you run this with elevated permissions locally.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Sign in to Azure.
Add-AzureRmAccount
 
#Set this variables
$appName = "SomeNameThatYouWillRegoniseInThePortal"
$uri = "AValidURIAlthoughNotApplicableForThis"
$secret = "SomePasswordForTheApplication"
 
# Create a AAD app
$azureAdApplication = New-AzureRmADApplication `
    -DisplayName $appName `
    -HomePage $Uri `
    -IdentifierUris $Uri `
    -Password $secret
 
# Create a Service Principal for the app
$svcprincipal = New-AzureRmADServicePrincipal -ApplicationId $azureAdApplication.ApplicationId
 
# To avoid a PrincipalNotFound error, I pause here for 15 seconds.
Start-Sleep -s 15
 
# If you still get a PrincipalNotFound error, then rerun the following until successful. 
$roleassignment = New-AzureRmRoleAssignment `
    -RoleDefinitionName Contributor `
    -ServicePrincipalName $azureAdApplication.ApplicationId.Guid
 
# The stuff you want:
 
Write-Output "Copy these values into the C# sample app"
 
Write-Output "_subscriptionId:" (Get-AzureRmContext).Subscription.SubscriptionId
Write-Output "_tenantId:" (Get-AzureRmContext).Tenant.TenantId
Write-Output "_applicationId:" $azureAdApplication.ApplicationId.Guid
Write-Output "_applicationSecret:" $secret
Write-Output "_environmentName:" (Get-AzureRmContext).Environment.Name

My recommendation here is to take the returned values and store that in something like the Class Library settings, available from the Visual Studio project properties. Don’t store them as constants at the top of your Class as its highly likely you’ll need them multiple times.

Next, what to do with the application GUID etc. Well in your Custom Activity C# will need something like the following. Apologies for dumping massive code blocks into this post, but you will need all of this in your Class if you want to use details from your ADF service and work with ADL files.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
class SomeCustomActivity : IDotNetActivity
{
	//Get credentials for app
	string domainName = Settings.Default.AzureDomainName;
	string appId = Settings.Default.ExcelExtractorAppId; //From PowerShell &lt;&lt;&lt;&lt;&lt;
	string appPass = Settings.Default.ExceExtractorAppPass; //From PowerShell &lt;&lt;&lt;&lt;&lt;
	string appName = Settings.Default.ExceExtractorAppName; //From PowerShell &lt;&lt;&lt;&lt;&lt;
 
	private static DataLakeStoreFileSystemManagementClient adlsFileSystemClient;
	//and or:
	private static DataLakeStoreAccountManagementClient adlsAccountManagerClient;
 
	public IDictionary&lt;string, string&gt; Execute(
		IEnumerable linkedServices,
		IEnumerable datasets,
		Activity activity,
		IActivityLogger logger)
	{
		//Get linked service details from Data Factory
		Dataset inputDataset = new Dataset();
		inputDataset = datasets.Single(dataset =&gt; 
			dataset.Name == activity.Inputs.Single().Name);
 
		AzureDataLakeStoreLinkedService inputLinkedService;
 
		inputLinkedService = linkedServices.First(
			linkedService =&gt;
			linkedService.Name ==
			inputDataset.Properties.LinkedServiceName).Properties.TypeProperties
			as AzureDataLakeStoreLinkedService;
 
		//Get account name for data lake and create credentials for app
		var creds = AuthenticateAzure(domainName, appId, appPass);
		string accountName = inputLinkedService.AccountName;
 
		//Authorise new instance of Data Lake Store
		adlsFileSystemClient = new DataLakeStoreFileSystemManagementClient(creds);
 
		/*
			DO STUFF...
 
			using (Stream input = adlsFileSystemClient.FileSystem.Open
				(accountName, completeInputPath)
				)
		*/	
 
 
		return new Dictionary&lt;string, string&gt;();
	}
 
 
	private static ServiceClientCredentials AuthenticateAzure
		(string domainName, string clientID, string clientSecret)
	{
		SynchronizationContext.SetSynchronizationContext(new SynchronizationContext());
 
		var clientCredential = new ClientCredential(clientID, clientSecret);
		return ApplicationTokenProvider.LoginSilentAsync(domainName, clientCredential).Result;
	}
}

Finally, before you execute anything be sure to grant the Azure app permissions to the respective Data Lake service. In the case of the Data Lake Store. From the portal you can use the Data Explorer blades to assign folder permissions.

adl-grant-permissions

I really hope this post has saved you some time in figuring out how to authorise Data Lake services from Data Factory. Especially when developing beyond what the ADF Copy Wizard gives you.

Many thanks for reading.


Paul’s Frog Blog

Paul is a Microsoft Data Platform MVP with 10+ years’ experience working with the complete on premises SQL Server stack in a variety of roles and industries. Now as the Business Intelligence Consultant at Purple Frog Systems has turned his keyboard to big data solutions in the Microsoft cloud. Specialising in Azure Data Lake Analytics, Azure Data Factory, Azure Stream Analytics, Event Hubs and IoT. Paul is also a STEM Ambassador for the networking education in schools’ programme, PASS chapter leader for the Microsoft Data Platform Group – Birmingham, SQL Bits, SQL Relay, SQL Saturday speaker and helper. Currently the Stack Overflow top user for Azure Data Factory. As well as very active member of the technical community.
Thanks for visiting.
@mrpaulandrew