Blob Storage

Creating Azure Data Factory Custom Activities

When creating an Azure Data Factory (ADF) solution you’ll quickly find that currently it’s connectors are pretty limited to just other Azure services and the T within ETL (Extract, Transform, Load) is completely missing altogether. In these situations where other functionality is required we need to rely on the extensibility of Custom Activities. A Custom Activity allows the use of .Net programming within your ADF pipeline. However, getting such an activity setup can be tricky and requires a fair bit of messing about. In this post a hope to get you started with all the basic plumbing needed to use the ADF Custom Activity component.

Visual Studio

Firstly, we need to get the Azure Data Factory tools for Visual Studio, available via the below link. This makes the process of developing custom activities and ADF pipelines a little bit easier. Compared to doing all the development work in the Azure portal. But be warned, because this stuff is still fairly new there are some pain points/quirks to overcome which I’ll point out.

https://visualstudiogallery.msdn.microsoft.com/371a4cf9-0093-40fa-b7dd-be3c74f49005

Once you have this extension available in Visual Studio create yourself a new solution with 2x projects. Data Factory and a C# Class Library. You can of course use VB if you prefer.
vsdatafactoryproject

Azure Services

Next, like the Visual Studio section above this is really a set of prerequisites for making the ADF custom activity work. Assuming you already have an ADF service running in your Azure subscription you’ll also need:

  • Azure Batch Service (ABS) – this acts as the compute for your C# called by the ADF custom activity. absThe ABS is a strange service which you’ll find when you spin one up. Under the hood it’s basically a virtual machine requiring CPU, RAM and an Operating System. Which you have to choose when deploying it (Windows or Linux available). But none of the graphical interface is available to use in a typical way, no RDP access to the Windows server below. Instead you give the service a compute Pool, where you need to assign CPU cores. The pool in turn has Tasks created in it by the calling services. Sadly because ADF is just for orchestration we need this virtual machine style glue and compute layer to handle our compiled C#.
  • Azure Storage Account (ASC) – this is required to house your compiled C# in it’s binary .DLL form. Aascs you’ll see further down this actually gets zipped up as well with all it’s supporting packages. It would be nice if the ABS allowed access to the OS storage for this, but no such luck I’m afraid.

At this point, if your doing this for the first time you’ll probably be thinking the same as me… Why on earth do I need all this extra engineering? What are these additional services going to cost? And, why can I not simply inline my C# in the ADF JSON pipeline and get it to handle the execution?

Well, I have voiced these very questions to the Microsoft Azure Research team and the Microsoft Tiger Team. The only rational answer is to keep ADF as a dum orchestrator that simply runs other services. Which is fine if it didn’t need this extensibility to do such simple things. This then leads into the argument about ADF being designed for data transformation. Should it just be for E and L, not T?

Let’s bottle up these frustrations for another day before this blog post turns into a rant!

C# Class Library

Moving on. Now for those of you that have ever read my posts before you’ll know that I don’t claim to be a C# expert. Well today is no exception! Expect fluffy descriptions in the next bit 🙂

First in your class project lets add the NuGet packages and references you’ll need for the library project to work with ADF. Using the Package Manager Console (Visual Studio > Tools > NuGet Package Manager > Package Manager Console) run the following installation lines to add all your required references.

Install-Package Microsoft.Azure.Management.DataFactories
Install-Package Azure.Storage

Next the fun bit. Whatever class name you decide to use it will need to inherit from IDotNetActivity which is the interface used at runtime by ADF. Then within the your new class you need to create an IDictionary method called Execute. It is this method that will be ran by the ABS when called from ADF.

Within the IDictionary method. Extended properties and details about the datasets and services on each side of the custom activity pipeline can be accessed. Here is the minimum of what you’ll need to connect the dots between ADF and your C#.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
using System;
using System.Collections.Generic;
using System.Linq;
 
using Microsoft.Azure;
using Microsoft.Azure.Management.DataFactories.Models;
using Microsoft.Azure.Management.DataFactories.Runtime;
 
namespace ClassLibrary1
{
    public class Class1 : IDotNetActivity
    {
        public IDictionary<string, string> Execute(
                IEnumerable linkedServices,
                IEnumerable datasets,
                Activity activity,
                IActivityLogger logger)
        {
            logger.Write("Start");
 
            //Get extended properties
            DotNetActivity dotNetActivityPipeline = (DotNetActivity)activity.TypeProperties;
 
            string sliceStartString = dotNetActivityPipeline.ExtendedProperties["SliceStart"];
 
            //Get linked service details
            Dataset inputDataset = datasets.Single(dataset => dataset.Name == activity.Inputs.Single().Name);
            Dataset outputDataset = datasets.Single(dataset => dataset.Name == activity.Outputs.Single().Name);
 
            /*
                DO STUFF
            */
 
            logger.Write("End");
 
            return new Dictionary<string, string>();
        }
    }
}

How you use the declared datasets will greatly depend on the linked services you have in and out of the pipeline. You’ll notice that I’ve also called the IActivityLogger using the write method to make user log entries. I’ll show you where this gets written to later from the Azure portal.

adfreferencetoclassesI appreciate that the above code block isn’t actually doing anything and that it’s probably just raised another load of questions. Patience, more blog posts are coming! Depending on what other Azure services you want your C# class to use next we’ll have to think about registering it as an Azure app so the compiled program can authenticate against other components. Sorry, but that’s for another time.

The last and most important thing to do here is add a reference to the C# class library in your ADF project. This is critical for a smooth deployment of the solution and complied C#.

Data Factory

Within your new or existing ADF project you’ll need to add a couple of things, specifically for the custom activity. I’m going to assume you have some datasets/data tables defined for the pipeline input and output.

Linked services first, corresponding to the above and what you should now have deployed in the Azure portal;

  • Azure Batch Linked Service – I would like to say that when presented with the JSON template for the ABS that filling in the gaps is pretty intuitive for even the most none technical peeps amongst us. However the names and descriptions are wrong within the typeProperties component! Here’s my version below with the corrections and elaborations on the standard Visual Studio template. Please extend your sympathies for the pain it took me to figure out where the values don’t match the attribute tags!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{
  "$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/
              Microsoft.DataFactory.LinkedService.json",
    "name": "AzureBatchLinkedService1",
    "properties": {
        "type": "AzureBatch",
      "typeProperties": {
        "accountName": "<Azure Batch account name>",
        //Fine - get it from the portal, under service properties.
 
        "accessKey": "<Azure Batch account key>",
        //Fine -  get it from the portal, under service properties.
 
        "poolName": "<Azure Batch pool name>",
        //WRONG - this actually needs to be the pool ID
        //that you defined when you deployed the service.
        //Using the Pool Name will error during deployment.
 
        "batchUri": "<Azure Batch uri>",
        //PARTLY WRONG - this does need to be the full URI that you
        //get from the portal. You need to exclude the batch
        //account name. So just something like https://northeurope.batch.azure.com
        //depending on your region.
        //With the full URI you'll get a message that the service can't be found!
 
        "linkedServiceName": "<Specify associated storage linked service reference here>"
        //Fine - as defined in your Data Factory. Not the storage
        //account name from the portal.
      }
    }
}
  • Azure Storage Linked Service – the JSON template here is ok to trust. It only requires the connection string for your blob store which can be retrieved from the Azure Portal and inserted in full. Nice simple authentication.

Once we have the linked services in place lets add the pipeline. Its worth noting that by pipeline I mean the ADF component that houses our activities. A pipeline is not the entire ADF end to end solution in this context. Many people do use it as a broad term for all ADF things incorrectly.

  • Dot Net Activity – here we need to give ADF all the bits it needs to go away and execute our C#. Which is again defined in the typeProperties. Below is a JSON snippet of just the typeProperties block that I’ve commented on to go into more detail about each attribute.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
"typeProperties": {
  "assemblyName": "<Name of the output DLL to be used by the activity. e.g: MyDotNetActivity.dll>",
  //Once your C# class library has been built the DLL name will come from the name as the
  //project in Visual Studio by default. You can also change this in the project properties
  //if you wish.
 
  "entryPoint": "<Namespace and name of the class that implements the IDotNetActivity interface e.g: MyDotNetActivityNS.MyDotNetActivity>",
  //This needs to include the namespace as well as the class. Which is what the default is
  //alluding to where the dot separation is used. Typically your namespace will be inheritated
  //from the project default. You might override this to be the CS filename though so be careful.
 
  "packageLinkedService": "<Name of the linked service that refers to the blob that contains the zip file>",
  //Just to the clear. Your storage account linked service name.
 
  "packageFile": "<Location and name of the zip file that was uploaded to the Azure blob storage e.g: customactivitycontainer/MyDotNetActivity.zip>"
  //Here's the ZIP file. If you haven't already you'll need to create a container in your
  //storage account under blobs. Reference that here. The ZIP filename will be the same
  //as the DDL file name. Don't worry about where the ZIP files gets created just yet.
}

adfsolutionwithcsharpBy now you should have a solution that looks something like the solution explorer panel on the right. In mine I’ve kept all the default naming conventions for ease of understanding.

Deployment Time

If you have all the glue in place you can now right click on your ADF project and select Publish. This launches a wizard which takes you through the deployment process. Again I’ve made an assumption here that you are logged into Visual Studio with the correct credentials for your Azure subscription. The wizard will guide you through where the ADF project is going to be deployed, it will also validate the JSON content before sending it up and it will also detect if files in the target ADF service can be deleted.

With the reference in place to the C# class library the deployment wizard will detect the project dependency and zip up the compiled DLLs from your bin folder and upload them into the blob storage linked service referenced in the activity pipeline.

Sadly there is no local testing available for this lot and we just have to develop by trial/deploy/run and error.

 

 

 

 

Runtime

adfmonandmanageTo help with debugging from the portal if you go to the ADF Monitor & Manage area you should have your pipeline displayed. Clicking on the custom activity block will reveal the log files in the right hand panel. The first is the default system stack trace and the other is anything written out by the C# logger.Write call(s). These will become your new best friend when trying to figure out what isn’t working.

Of course you don’t need to perform a full publish of the ADF project every time if your only developing the C# code. Simply build the solution and upload a new ZIP file to your blob storage account using something like Microsoft Azure Storage Explorer. Then rerun the time slice for the output dataset.
adfmonitoring
If nothing appears to be happening you may also want to check on your ABS to ensure tasks are being created from ADF. If you haven’t assigned the compute pool any CPU cores it will just sit there and your ADF pipeline activity will time out with no errors and no clues as to what might have gone wrong. Trust me, I’ve been there too.
azurebatchmon
I hope this post was helpful and gave you a steer as to the requirements for extending your existing ADF solutions with .Net activity.

Many thanks for reading.

Paul’s Frog Blog

Paul is a Microsoft Data Platform MVP with 10+ years’ experience working with the complete on premises SQL Server stack in a variety of roles and industries. Now as the Business Intelligence Consultant at Purple Frog Systems has turned his keyboard to big data solutions in the Microsoft cloud. Specialising in Azure Data Lake Analytics, Azure Data Factory, Azure Stream Analytics, Event Hubs and IoT. Paul is also a STEM Ambassador for the networking education in schools’ programme, PASS chapter leader for the Microsoft Data Platform Group – Birmingham, SQL Bits, SQL Relay, SQL Saturday speaker and helper. Currently the Stack Overflow top user for Azure Data Factory. As well as very active member of the technical community.
Thanks for visiting.
@mrpaulandrew