JSON

Chaining Azure Data Factory Activities and Datasets

As I work with Azure Data Factory (ADF) and help others in the community more and more I encounter some confusion that seems to exist surrounding how to construct a complete dependency driven ADF solution. One that chains multiple executions and handles all of your requirements. In this post I hope to address some of that confusion and will allude to some emerging best practices for Azure Data Factory usage.

First a few simple questions:

  • Why is there confusion? In my opinion this is because the ADF copy wizard available via the Azure portal doesn’t help you architect a complete solution. It can be handy to reverse certain things, but really the wizard tells you nothing about the choices you make and what the JSON behind it is doing. Like most wizards, it just leads to bad practices!
  • Do I need several data factory services for different business functions? No, you don’t have to. Pipelines within a single data factory service can be disconnected for different processes and often having all your linked services in one place is easier to manage. Plus a single factor offers reusability and means I single set of source code etc.
  • Do I need one pipeline per activity. No, you can house many activities in a single pipeline. Pipelines are just logic containers to assist you when managing data orchestration tasks. If you want an SSIS comparison, think of them as sequence containers. In a factory I may group all my on premises gateway uploads into a single pipeline. This means I can pause that stream of uploads on demand. Maybe when the gateway keys needs to be refreshed etc.
  • Is the whole data factory a pipeline? Yes, in concept. But for technical terminology a pipeline is a specific ADF component. The marketing people do love to confuse us!
  • Can an activity support multiple inputs and multiple outputs? Generally yes. But there are exceptions depending on the activity type. U-SQL calls to Azure Data Lake can have multiples of both. ADF doesn’t care as long as you know what the called service is doing. On the other hand a copy activity needs to be one to one (so Microsoft can charge more for data movements).
  • Does an activity have to have an input dataset? No. For example, you can create a custom activity that executes your code for a defined time slice without an input dataset, just the output.

Datasets

Moving on, lets go a little deeper and think about a scenario that I use in my community talks. We have an on premises CSV file. We want to upload it. Clean it and aggregate the output. For each stage of this process we need to define a dataset for Azure Data Factory to use.

To be clear, a dataset in this context is not the actual data. It is just a set of JSON instructions that defines where and how our data is stored. For example, its file path, its extension, its structure, its relationship to the executing time slice.

Lets define each of the datasets we need in ADF to complete the above scenario for just 1 file:

  1. The on premises version of the file. Linked to information about the data management gateway to be used, with local credentials and file server/path where it can be accessed.
  2. A raw Azure version of the file. Linked to information about the data lake storage folder to be used for landing the uploaded file.
  3. A clean version of the file. Linked to information about the output directory of the cleaning process.
  4. The aggregated output file. Linked to information about the output directory of the query being used to do the aggregation.

All of the linked information to these datasets should come from your ADF linked services.

So, we have 1 file to process, but in ADF we now need 4 datasets defined for each stage of the data flow. These datasets don’t need to be complex, something as simple as the following bit of JSON will do.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
  "name": "LkpsCurrencyDataLakeOut",
  "properties": {
    "type": "AzureDataLakeStore",
    "linkedServiceName": "DataLakeStore",
    "structure": [ ],
    "typeProperties": {
      "folderPath": "Out",
      "fileName": "Dim.Currency.csv"
    },
    "availability": {
      "frequency": "Day",
      "interval": 1
    }
  }
}

Activities

Next, our activities. Now the datasets are defined above we need ADF to invoke the services that are going to do the work for each stage. As follows:

Activity (JSON Value) Task Description Input Dataset Output Dataset
Copy Upload file from local storage to Data Lake storage. 1 2
DotNetActivity Perform transformation/cleaning on raw source file. 2 3
DataLakeAnalyticsU-SQL Aggregate the datasets to produce a reporting output. 3 4

From the above table we can clearly see the output dataset of the first activity becomes the input of the second. The output dataset of the second activity becomes the input of the third. Apologies if this seems obvious, but I have know it to confuse people.

Pipelines

For our ADF pipeline(s) we can now make some decisions about how we want to manage the data flow.

  1. Add all the activities to a single pipeline meaning we can stop/start everything for this 1 dataset end to end.
  2. Add each activity to a different pipeline dependant on its type. This is my starting preference.
  3. Have the on premises upload in one pipeline and everything else in a second pipeline.
  4. Maybe separate your pipelines and data flows depending on the type of data. Eg. Fact/dimension. Finance and HR.

The point here, is that it doesn’t matter to ADF, it’s just down to how you want to control it. When I created the pipelines for my talk demo I went with option 2. Meaning I get the following pretty diagram, arranged to fit the width of my blog ūüôā

Here we can clearly see at the top level each dataset flowing into a pipeline and its child activity. If I’m constructed this using option 1 above I would simply see the first dataset and the fourth with 1 pipeline box. I could then drill into the pipeline to see the chain activities within. A repeat, this doesn’t matter to ADF.

I hope you found the above useful and a good starting point for constructing your ADF data flows.

Best Practices

As our understanding of Azure Data Factory matures I’m sure some of the following points will need to be re-written, but for now I’m happy to go first and start laying the ground work of what I consider to be best for ADF usage. Comments very welcome.

  1. Resist using the wizard, please.
  2. Keep everything within a single ADF service if you can. Meaning linked services can be reused.
  3. Disconnect your on premises uploads using a single pipeline. For ease of management.
  4. Group your activities into natural pipeline containers for the operation type or data category.
  5. Layout your ADF diagram carefully. Left to right. It makes understanding it much easier for others.
  6. Use Visual Studio configuration files to deploy ADF projects between Dev/Test/Live. Ease of source control and development.
  7. Monitor activity concurrency and time outs carefully. ADF will kill called service executions if breached.
  8. Be mindful of activity cost and group inputs/outputs for data compute where possible.
  9. Use time slices to control your data volumes. Eg. Pass the time slice as a parameter to the called compute service.

What next? Well, I’m currently working on this beast…

  • 127x datasets.
  • 71x activities.
  • 9x pipelines.

… and I’ve got about another third left to build!

Many thanks for reading.


Passing Parameters to U-SQL from Azure Data Factory

Let’s try and keep this post short and sweet.¬†Diving right in¬†imagine a scenario where we have an Azure Data Factory (ADF)¬†pipeline that includes activities to perform U-SQL jobs in Azure Data Lake (ADL) Analytics.¬†We want to control the U-SQL by passing the ADF time slice value to the script, hopefully a fairly common use case. This isn’t yet that intuitive when constructing the ADF JSON activity so I hope this post will save you some debugging time.

For my example I’ve created a stored procedure in my ADL Analytics database that accepts a parameter @TimeSliceStart as a string value in the format yyyyMMdd.

[ExampleDatabase].[dbo].[usp_DoSomeStuff](@TimeSliceStart);

This doesn’t have to be a stored procedure. ADF is also happy if you give if U-SQL files or even just inline the entire script. Regardless, the ADF¬†parameter handling is the same.

In my ADF JSON activity I then have the following;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"name": "DataLakeJob1",
"properties": {
"description": "Run USQL with timeslice param",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
	"script": "[ExampleDatabase].[dbo].[usp_DoSomeStuff](@TimeSliceStart);",
	"degreeOfParallelism": 5,
	"priority": 1,
	"parameters": {
		"TimeSliceStart": "$$Text.Format('{0:yyyyMMdd}', Time.AddMinutes(SliceStart, 0))"
	}
} // etc ... } ] // etc ...}}}

Notice in the extended properties we have a parameters attribute that can include children of the actual variables we want to pass to the U-SQL script.

Here’s the important things to understand about this ADF parameters attribute.

  • The name of the parameter must match the name of the variable expected by U-SQL exactly.
  • As you would expect the data types of the expected variable and JSON parameter must match.
  • It is perfectly acceptable to have multiple parameters in the ADF JSON and written in any order.

So how does this work?…

What ADF does when calling ADL is take the parameters listed in the¬†¬†JSON and write out a bunch of U-SQL ‘DECLARE @Variable’¬†lines. These then get appended to the top of the actual U-SQL script before giving it to ADL as a job to run. You can see this if you go into the ADL Analytics blades in the Azure portal, select the job created by the ADF activity then choose Duplicate Script. This reveals the actual U-SQL used in the job.

Here’s the proof.

 

 

 

 

 

 

 

 

 

 

Then…

Just knowing what ADF does when converting the JSON parameters to U-SQL declared variables is the main take away here.

That’s it! I promised short and sweet.

Many thanks for reading.


Using Azure Data Factory Configuration Files

Like most things developed its very normal to have multiple¬†environments for the same solution; dev, test, prod etc. Azure Data Factory is no exception to this. However, where it does differ slightly is the way it handles publishing to different environments from the Visual Studio tools provided. In this post we’ll explore exactly how to create Azure Data Factory (ADF) configuration files to support such¬†deployments to different Azure services/directories.

For all the examples in this post I’ll be working with Visual Studio 2015 and the ADF extension available from the market place or¬†via the below link.

https://marketplace.visualstudio.com/items?itemName=AzureDataFactory.MicrosoftAzureDataFactoryToolsforVisualStudio2015

Before we move on lets take a moment to say that Azure Data Factory configuration files are purely a Visual Studio feature. At publish time Visual Studio simply takes the config file content and replaces the actual JSON attribute values before deploying in Azure. That said, to be explicit.

  • An ADF JSON file with attribute values missing (because they come from config files) cannot be deployed using the Azure portal ‘Author and Deploy’ blade. This will just fail validation as missing content.
  • An ADF JSON config file cannot be deployed using the Azure portal ‘Author and Deploy’ blade. It is simply not understood by the browser based tool as a valid ADF schema.

Just as an aside. Code comments in ADF JSON files are also purely a Visual Studio feature. You can only comment your JSON in the usual way in Visual Studio, which at publish time will strip these out for you. Any comments left in code that you copy and paste into the Azure portal will return as syntax errors! I have already given feedback to the Microsoft product team that code comments in the portal blades would be really handy. But I digress.

Apologies in advance if I switch between the word publish and deploy too much. I mean the same thing. I prefer deploy, built in a Visual Studio ADF solution its called publish in the menu.

Creating a ADF Configuration File

First lets use the Visual Studio tools to create a common set of configuration files. In a new ADF project you have the familiar tree including Linked Services, Pipelines etc. Now right click on the project and choose Add > New Item. In the dialogue presented choose  Config and add a Configuration File to the project, with a suitable name.

I went to town and did a set of three¬†ūüôā

Each time you add a config file to your ADF project. Or any component for that matter. You’ll be aware that Visual Studio tries to help you out by giving you a JSON template or starter for what you might want. This is good, but in the case of ADF config files isn’t that intuitive. Hence this blog post. Lets move on.

Populating the Configuration File

Before we do anything let me attempt to put into word what we need to do here. Every JSON attribute has a reference of varying levels to get to its value. ¬†When we recreate a value in our config file we need to recreate this reference path exactly from the top level of the component name. In the config file this goes as a parent (at the same level as schema) followed by square brackets [ ] which then contain the rest of the content we want to replace. Next within the square brackets of the component we need pairs of attributes (name and value). These represent the references to the actual component structure. In the ‘name’ value we start with a $. which represents the root of the component file. Then we build up the tree reference with a dot for each new level. Lastly, the value is as it says. Just the value to be used instead of whatever may be written in the actual component file.

Make sense? Referencing JSON with JSON? I said it wasn’t intuitive. Lets move on and see it.

Lets populate our configuration files with something useful. This of course greatly depends on what your data factory is doing as to what values you might want to switch between environments, but lets start with a few common attributes. For this example lets alter a pipelines schedule start, end and paused values. I always publish to dev as paused to give me more control over running the pipeline.

At the bottom of our pipeline component file¬†I’ve done the following.

1
2
3
4
5
6
7
8
9
    //etc...
	//activities block
	],
    "start": "1900-01-01", /*<get from config file>*/
    "end": "1900-01-01", /*<get from config file>*/
    "isPaused": /*<get from config file>*/,
    "pipelineMode": "Scheduled"
  }
}

… Which means in my config file I need to create the equivalent set of attribute references and values. Note; the dollar for the root, then one level down into the properties namespace. Then another dot before the attribute.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
  "ExactNameOfYourPipeline": [ // <<< Component name. Exactly!
    {
      "name": "$.properties.isPaused",
      "value": true
    },
    {
      "name": "$.properties.start",
      "value": "2016-08-01"
    },
    {
      "name": "$.properties.end",
      "value": "2017-06-01"
    }
  ]
}

A great thing about this approach with ADF tools in Visual Studio is that any attribute value can be overridden with something from a config file. It’s really flexible and each component can be added in the same way regardless of type.¬†There are however some quirks/features to be aware of, as below.

  • All parent and child name referencing within the config file must match its partner in the actual component JSON file exactly.
  • All referencing is¬†case sensitive. But Visual Studio won’t validate this for you in intellisense or when building the project.
  • In the actual component file some attribute values can be left blank as they come from config. Others cannot and will result in the ADF project failing to build.
  • For any config referencing that fails.¬†You’ll only figure this out¬†when you publish and check the Azure portal to see that the JSON file in place has its original content. Fun.

Right then. Hope that’s all clear as mud ūüôā

Publishing using Different Configurations

Publishing is basically the easy bit. Involving a wizard so I don’t need to say much here.

Right click on the project in Visual Studio and choose Publish. In the publish items panel of the wizard simply select the config file you want to use for the deployment.

I hope this post of helpful and saved you some time when developing with ADF.

Many thanks for reading.


Creating Azure Data Factory Custom Activities

When creating an Azure Data Factory (ADF) solution you’ll quickly find that currently it’s connectors are pretty limited to just other Azure services and the T within ETL (Extract, Transform, Load) is completely missing altogether. In these situations where¬†other functionality is required¬†we need to rely on the extensibility of Custom Activities.¬†A Custom Activity allows¬†the use of .Net programming within your ADF pipeline. However, getting such an activity setup can be tricky and requires a fair bit of messing about. In this post a hope to get you started with all the basic plumbing needed to use the ADF¬†Custom Activity component.

Visual Studio

Firstly, we¬†need to get the Azure Data Factory tools for Visual Studio, available via the below link. This makes the process of developing custom activities and ADF pipelines a¬†little bit easier. Compared to¬†doing all the development work in the Azure portal. But be warned, because this stuff is still fairly new there are some pain points/quirks to overcome which I’ll point out.

https://visualstudiogallery.msdn.microsoft.com/371a4cf9-0093-40fa-b7dd-be3c74f49005

Once you have this extension available in Visual Studio create yourself a new solution with 2x projects. Data Factory and a C# Class Library. You can of course use VB if you prefer.
vsdatafactoryproject

Azure Services

Next, like the Visual Studio section above this is really a set of¬†prerequisites¬†for making the ADF custom activity work. Assuming you already have an ADF service running in your Azure subscription you’ll also need:

  • Azure Batch Service (ABS)¬†– this acts¬†as the compute for your C# called by the ADF custom activity. absThe ABS is a strange service which you’ll find when you spin one up. Under the hood it’s basically a virtual machine requiring CPU, RAM and an Operating System. Which you have to choose when deploying it¬†(Windows or Linux available). But none of the graphical interface is available to use in a typical way, no RDP access to the Windows server below. Instead you give the service a compute Pool, where you need to assign CPU cores. The pool in turn has Tasks created in it¬†by the calling services. Sadly because ADF is just for orchestration we need¬†this virtual machine style glue and compute layer to handle our compiled C#.
  • Azure¬†Storage Account¬†(ASC) – this is required to house your compiled C# in it’s binary .DLL form. Aascs you’ll see further down this¬†actually gets zipped up as well with all it’s supporting packages. It would be nice if the ABS allowed access to the OS storage for this, but no such luck I’m afraid.

At this point, if your doing this for the first time¬†you’ll probably be¬†thinking the same as me… Why on earth do I need all this extra engineering? What are these additional services going to cost? And, why can I not simply inline my C# in the ADF JSON pipeline and get it to handle the execution?

Well, I have voiced these very questions to the Microsoft Azure Research team and the Microsoft Tiger Team. The only rational answer is to keep ADF as a dum orchestrator that simply runs other services. Which is fine if it didn’t need this extensibility to do such simple things. This then leads into the¬†argument about ADF being designed for data transformation. Should it just be for E and L, not T?

Let’s bottle up these frustrations for another day¬†before this blog post turns into a rant!

C# Class Library

Moving on. Now for those of you that have ever read my posts before you’ll know that I don’t claim to be a C# expert. Well today is no exception! Expect fluffy descriptions in the next¬†bit¬†ūüôā

First in your class project lets add the NuGet packages and references you’ll need for the library project to work with ADF. Using the Package Manager Console (Visual Studio > Tools > NuGet Package Manager > Package Manager Console)¬†run the following installation lines to add all your required references.

Install-Package Microsoft.Azure.Management.DataFactories
Install-Package Azure.Storage

Next the fun bit. Whatever class name you decide to use it will need to inherit from IDotNetActivity which is the interface used at runtime by ADF. Then within the your new class you need to create an IDictionary method called Execute. It is this method that will be ran by the ABS when called from ADF.

Within the IDictionary method. Extended properties and details about the datasets and services on each side of the custom activity pipeline can be accessed. Here is the minimum of what you’ll need to connect the dots between ADF and your C#.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
using System;
using System.Collections.Generic;
using System.Linq;
 
using Microsoft.Azure;
using Microsoft.Azure.Management.DataFactories.Models;
using Microsoft.Azure.Management.DataFactories.Runtime;
 
namespace ClassLibrary1
{
    public class Class1 : IDotNetActivity
    {
        public IDictionary&lt;string, string&gt; Execute(
                IEnumerable linkedServices,
                IEnumerable datasets,
                Activity activity,
                IActivityLogger logger)
        {
            logger.Write("Start");
 
            //Get extended properties
            DotNetActivity dotNetActivityPipeline = (DotNetActivity)activity.TypeProperties;
 
            string sliceStartString = dotNetActivityPipeline.ExtendedProperties["SliceStart"];
 
            //Get linked service details
            Dataset inputDataset = datasets.Single(dataset =&gt; dataset.Name == activity.Inputs.Single().Name);
            Dataset outputDataset = datasets.Single(dataset =&gt; dataset.Name == activity.Outputs.Single().Name);
 
            /*
                DO STUFF
            */
 
            logger.Write("End");
 
            return new Dictionary&lt;string, string&gt;();
        }
    }
}

How you use the declared datasets will greatly depend on the linked services you have in and out of the pipeline. You’ll notice that I’ve also called the IActivityLogger using the write method to make user log entries. I’ll show you where this gets written to later from the Azure portal.

adfreferencetoclassesI appreciate that the above code block isn’t actually doing anything and that it’s probably just raised another load of questions. Patience, more blog posts are coming! Depending on what other Azure services you want your C# class to use next we’ll have to think about registering it as an Azure¬†app so the compiled program can authenticate against other components. Sorry, but that’s for another time.

The last and most important thing to do here is add a reference to the C# class library in your ADF project. This is critical for a smooth deployment of the solution and complied C#.

Data Factory

Within your new or existing ADF project you’ll need to add a couple of things, specifically for the custom activity. I’m going to assume you¬†have some datasets/data tables¬†defined for the pipeline input and output.

Linked services first, corresponding to the above and what you should now have deployed in the Azure portal;

  • Azure Batch Linked Service – I would like to say that when presented with the JSON template for the ABS that filling in the gaps is pretty intuitive for even the most none technical peeps amongst us. However the names and descriptions are wrong within the typeProperties component! Here’s¬†my version below with the corrections and elaborations on the standard Visual Studio template. Please extend your sympathies for the pain it took me to figure out where the values don’t match the attribute tags!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{
  "$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/
              Microsoft.DataFactory.LinkedService.json",
    "name": "AzureBatchLinkedService1",
    "properties": {
        "type": "AzureBatch",
      "typeProperties": {
        "accountName": "<Azure Batch account name>",
        //Fine - get it from the portal, under service properties.
 
        "accessKey": "<Azure Batch account key>",
        //Fine -  get it from the portal, under service properties.
 
        "poolName": "<Azure Batch pool name>",
        //WRONG - this actually needs to be the pool ID
        //that you defined when you deployed the service.
        //Using the Pool Name will error during deployment.
 
        "batchUri": "<Azure Batch uri>",
        //PARTLY WRONG - this does need to be the full URI that you
        //get from the portal. You need to exclude the batch
        //account name. So just something like https://northeurope.batch.azure.com
        //depending on your region.
        //With the full URI you'll get a message that the service can't be found!
 
        "linkedServiceName": "<Specify associated storage linked service reference here>"
        //Fine - as defined in your Data Factory. Not the storage
        //account name from the portal.
      }
    }
}
  • Azure Storage Linked Service – the JSON template here is ok to trust. It only requires the connection string for your blob store which can be retrieved from the Azure Portal and inserted in full. Nice simple authentication.

Once we have the linked services in place lets add the pipeline. Its worth noting that by pipeline I mean the ADF component that houses our activities. A pipeline is not the entire ADF end to end solution in this context. Many people do use it as a broad term for all ADF things incorrectly.

  • Dot Net Activity – here we need to give ADF all the bits it needs to go away and execute¬†our C#. Which is again defined in the typeProperties. Below is a JSON snippet of just the typeProperties block that I’ve commented on to go into more detail about each attribute.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
"typeProperties": {
  "assemblyName": "<Name of the output DLL to be used by the activity. e.g: MyDotNetActivity.dll>",
  //Once your C# class library has been built the DLL name will come from the name as the
  //project in Visual Studio by default. You can also change this in the project properties
  //if you wish.
 
  "entryPoint": "<Namespace and name of the class that implements the IDotNetActivity interface e.g: MyDotNetActivityNS.MyDotNetActivity>",
  //This needs to include the namespace as well as the class. Which is what the default is
  //alluding to where the dot separation is used. Typically your namespace will be inheritated
  //from the project default. You might override this to be the CS filename though so be careful.
 
  "packageLinkedService": "<Name of the linked service that refers to the blob that contains the zip file>",
  //Just to the clear. Your storage account linked service name.
 
  "packageFile": "<Location and name of the zip file that was uploaded to the Azure blob storage e.g: customactivitycontainer/MyDotNetActivity.zip>"
  //Here's the ZIP file. If you haven't already you'll need to create a container in your
  //storage account under blobs. Reference that here. The ZIP filename will be the same
  //as the DDL file name. Don't worry about where the ZIP files gets created just yet.
}

adfsolutionwithcsharpBy now you should have a solution that looks something like the solution explorer panel on the right. In mine I’ve kept all the default naming conventions for ease of understanding.

Deployment Time

If you have all the glue in place you can now right click on your ADF project and select Publish. This launches a wizard which takes you through the deployment process. Again I’ve made an assumption here that you are logged into Visual Studio with the correct credentials for your Azure subscription. The wizard will guide you through where the ADF project is going to be deployed, it will also validate the JSON content before sending it up and it will also detect if files in the target ADF service can be deleted.

With the reference in place to the C# class library the deployment wizard will detect the project dependency and zip up the compiled DLLs from your bin folder and upload them into the blob storage linked service referenced in the activity pipeline.

Sadly there is no local testing available for this lot and we just have to develop by trial/deploy/run and error.

 

 

 

 

Runtime

adfmonandmanageTo help with debugging from the portal if you go to the ADF Monitor & Manage area you should have your pipeline displayed. Clicking on the custom activity block will reveal the log files in the right hand panel. The first¬†is the default system stack trace and the other is anything written out by the C# logger.Write call(s). These will become your new best friend when trying to figure out what isn’t working.

Of course you don’t¬†need to perform a full publish of the ADF project every time if your only¬†developing the C# code. Simply build the solution and upload a new ZIP file to¬†your blob storage account using something like Microsoft Azure Storage Explorer. Then rerun the time slice¬†for the output dataset.
adfmonitoring
If nothing appears to be happening you may also want to check on your ABS to ensure tasks are being created from ADF. If you haven’t assigned the compute pool any CPU cores it will just sit there and your ADF pipeline activity will time out with no errors and no clues as to what might have gone wrong. Trust me, I’ve been there too.
azurebatchmon
I hope this post was helpful and gave you a steer as to the requirements for extending your existing ADF solutions with .Net activity.

Many thanks for reading.

Azure Virual Machine CPU Cores Quota

Did you know that by default your Azure subscription has a limit on the number of CPU cores you can consume for virtual machines? Until recently I didn’t. Maybe like you I’d only ever created one of two small virtual machines (VMs) within my MSDN subscription for testing and always ended up deleting them before coming close to the limit.

Per Azure subscription the default quota for virtual machine CPU cores is 20.

To clarify because the relationship between virtual and physical CPU cores can get a little confusing. This initial limit is directly related to what you see in the Azure portal when sizing your VM during creation.

AzureBasicVMsFor example, using the range of basic virtual machines to the right you would hit your default limit with;

  • 10x A2 Basic’s
  • 5x A3 Basic’s
  • 2.5x A4 Basic’s¬†(if it were possible to have¬†0.5 of¬†a VM)

Fortunately this initial quota is only a soft limit and easily lifted for your production Azure subscriptions.

Before¬†we look at how to solve this problem it’s worth learning how to recognise when you’ve hit the VM CPU core limit, as its not always obvious.

Limit Symptoms Lesson 1

When creating a new¬†VM you¬†might find the some of the larger size options are greyed out without any obvious reason why. This will be because consumed cores¬†+ new machine size cores will be greater than your current subscription limit. Therefore ‘Not Available’. Example below.
AzureVMSizesGrey

The other important quirk here is that the size blade will only dynamically alter its colourations and availability state if the CPU cores are currently being used in VMs that are running. If you have already reached your limit, but the VMs are stopped and appearing as de-allocated resources, you will still be able to proceed and deploy another VM exceeding your limit. This becomes clearer in lesson 2 below where the failure occurs.

Limit Symptoms Lesson 2

AzureVMCreateNormErrorIf you are¬†unlucky enough to have exceeded your limit with de-allocated resource and you were able to get past the VM size selection without issue your new VM will¬†now be¬†deploying on your Azure dashboard if pinned.¬†All good right?… Wrong!¬†You’ll then hit this deployment failure alert from the lovely notifications bell. Example¬†on the right.

Note; I took this screen shot from a different Azure subscription where I’d already increased my quota to 40x cores.

This could occur in the following scenario.

  • You have 2x A4 Basic virtual machines already created. 1x is running and 1x is stopped.
  • The stopped VM meant you were able to proceed in creating a third A4 Basic VM.
  • During deployment the Azure portal has now done its sums of all resources covering both stopped and running VM’s.
  • 8x cores on running VM + 8x cores on stopped VM + 8x cores on newly deployed VM. Total 24.
  • This has exceeded your limit by 4x cores.
  • Deployment failed.

In short; this is the difference in behaviour between validation of your VM at creation time vs validation of your VM at deployment time. AKA a feature!

Limit Symptoms Lesson 3

AzureVMCreateWithJSONError2If deploying VMs using a JSON template you will probably be abstracted away from the size and cores consumed by the new VM because in default template this is just hardcodes into the relevant JSON attribute.

Upon¬†clicking ‘Create’ on the portal blade you will be presented with an error similar to the example on the right. This is of course a little more obvious compared to lesson 1 and more helpful than lesson 2 in the sense that the deployment hasn’t been started yet. But still this doesn’t really give you much in terms of a solution, unless you are already aware of the default quota.

Apparently my storage account name wasn’t correct when I took this screen shot either. Maybe another blob post required here covered where the Azure portal is case sensitive and where it isn’t! Moving on.

 

 The Solution

As promised, the solution to all the frustration you’ve encountered above.

AzureHelpAndSupportTo increase the number of CPU cores available to your Azure subscription you will need to raise a support ticket with the Azure help desk… Don’t sigh!… I assure this is not as troublesome as you might think.

Within the Help and Support section of your Azure portal there are a series of predefined menus to do exactly this. Help and Support will be on your dashboard by default, or available via the far left hand root menu.

Within the Help and Support blade click to add a New Support Request. Then follow the prompts selecting the Issue Type as ‘Quota’, your affected Subscription and the Quota Type as ‘Cores per Subscription’.

AzureIncreaseCPUQuota

Once submitted a friendly Microsft human will¬†review and approve the request to make sure it’s reasonable…. Requesting 1000 extra CPU cores might get rejected! For me requesting an additional 30 cores took only hours to get approved and made available. Not 2 – 4 business days as the expectation managing auto reply would have you believe.

Of course I can’t promise this will always happen as quickly so my advise would be; know your current limits and allow for time to change them if you need to scale your production systems.

I hope this post saved you the time I lost when creating 17x VMs for a community training session.

Many thanks for reading.


Paul’s Frog Blog

Paul is a Microsoft Data Platform MVP with 10+ years‚Äô experience working with the complete on premises SQL Server stack in a variety of roles and industries. Now as the Business Intelligence Consultant at Purple Frog Systems has turned his keyboard to big data solutions in the Microsoft cloud. Specialising in Azure Data Lake Analytics, Azure Data Factory, Azure Stream Analytics, Event Hubs and IoT. Paul is also a STEM Ambassador for the networking education in schools‚Äô programme, PASS chapter leader for the Microsoft Data Platform Group ‚Äď Birmingham, SQL Bits, SQL Relay, SQL Saturday speaker and helper. Currently the Stack Overflow top user for Azure Data Factory. As well as very active member of the technical community.
Thanks for visiting.
@mrpaulandrew