When creating an Azure Data Factory (ADF) solution you’ll quickly find that currently it’s connectors are pretty limited to just other Azure services and the T within ETL (Extract, Transform, Load) is completely missing altogether. In these situations where other functionality is required we need to rely on the extensibility of Custom Activities. A Custom Activity allows the use of .Net programming within your ADF pipeline. However, getting such an activity setup can be tricky and requires a fair bit of messing about. In this post a hope to get you started with all the basic plumbing needed to use the ADF Custom Activity component.
Visual Studio
Firstly, we need to get the Azure Data Factory tools for Visual Studio, available via the below link. This makes the process of developing custom activities and ADF pipelines a little bit easier. Compared to doing all the development work in the Azure portal. But be warned, because this stuff is still fairly new there are some pain points/quirks to overcome which I’ll point out.
https://visualstudiogallery.msdn.microsoft.com/371a4cf9-0093-40fa-b7dd-be3c74f49005
Once you have this extension available in Visual Studio create yourself a new solution with 2x projects. Data Factory and a C# Class Library. You can of course use VB if you prefer.
Azure Services
Next, like the Visual Studio section above this is really a set of prerequisites for making the ADF custom activity work. Assuming you already have an ADF service running in your Azure subscription you’ll also need:
- Azure Batch Service (ABS) – this acts as the compute for your C# called by the ADF custom activity.
The ABS is a strange service which you’ll find when you spin one up. Under the hood it’s basically a virtual machine requiring CPU, RAM and an Operating System. Which you have to choose when deploying it (Windows or Linux available). But none of the graphical interface is available to use in a typical way, no RDP access to the Windows server below. Instead you give the service a compute Pool, where you need to assign CPU cores. The pool in turn has Tasks created in it by the calling services. Sadly because ADF is just for orchestration we need this virtual machine style glue and compute layer to handle our compiled C#.
- Azure Storage Account (ASC) – this is required to house your compiled C# in it’s binary .DLL form. A
s you’ll see further down this actually gets zipped up as well with all it’s supporting packages. It would be nice if the ABS allowed access to the OS storage for this, but no such luck I’m afraid.
At this point, if your doing this for the first time you’ll probably be thinking the same as me… Why on earth do I need all this extra engineering? What are these additional services going to cost? And, why can I not simply inline my C# in the ADF JSON pipeline and get it to handle the execution?
Well, I have voiced these very questions to the Microsoft Azure Research team and the Microsoft Tiger Team. The only rational answer is to keep ADF as a dum orchestrator that simply runs other services. Which is fine if it didn’t need this extensibility to do such simple things. This then leads into the argument about ADF being designed for data transformation. Should it just be for E and L, not T?
Let’s bottle up these frustrations for another day before this blog post turns into a rant!
C# Class Library
Moving on. Now for those of you that have ever read my posts before you’ll know that I don’t claim to be a C# expert. Well today is no exception! Expect fluffy descriptions in the next bit 🙂
First in your class project lets add the NuGet packages and references you’ll need for the library project to work with ADF. Using the Package Manager Console (Visual Studio > Tools > NuGet Package Manager > Package Manager Console) run the following installation lines to add all your required references.
Install-Package Microsoft.Azure.Management.DataFactories
Install-Package Azure.Storage
Next the fun bit. Whatever class name you decide to use it will need to inherit from IDotNetActivity which is the interface used at runtime by ADF. Then within the your new class you need to create an IDictionary method called Execute. It is this method that will be ran by the ABS when called from ADF.
Within the IDictionary method. Extended properties and details about the datasets and services on each side of the custom activity pipeline can be accessed. Here is the minimum of what you’ll need to connect the dots between ADF and your C#.
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Azure;
using Microsoft.Azure.Management.DataFactories.Models;
using Microsoft.Azure.Management.DataFactories.Runtime;
namespace ClassLibrary1
{
public class Class1 : IDotNetActivity
{
public IDictionary<string, string> Execute(
IEnumerable linkedServices,
IEnumerable datasets,
Activity activity,
IActivityLogger logger)
{
logger.Write("Start");
//Get extended properties
DotNetActivity dotNetActivityPipeline = (DotNetActivity)activity.TypeProperties;
string sliceStartString = dotNetActivityPipeline.ExtendedProperties["SliceStart"];
//Get linked service details
Dataset inputDataset = datasets.Single(dataset => dataset.Name == activity.Inputs.Single().Name);
Dataset outputDataset = datasets.Single(dataset => dataset.Name == activity.Outputs.Single().Name);
/*
DO STUFF
*/
logger.Write("End");
return new Dictionary<string, string>();
}
}
}
How you use the declared datasets will greatly depend on the linked services you have in and out of the pipeline. You’ll notice that I’ve also called the IActivityLogger using the write method to make user log entries. I’ll show you where this gets written to later from the Azure portal.
I appreciate that the above code block isn’t actually doing anything and that it’s probably just raised another load of questions. Patience, more blog posts are coming! Depending on what other Azure services you want your C# class to use next we’ll have to think about registering it as an Azure app so the compiled program can authenticate against other components. Sorry, but that’s for another time.
The last and most important thing to do here is add a reference to the C# class library in your ADF project. This is critical for a smooth deployment of the solution and complied C#.
Data Factory
Within your new or existing ADF project you’ll need to add a couple of things, specifically for the custom activity. I’m going to assume you have some datasets/data tables defined for the pipeline input and output.
Linked services first, corresponding to the above and what you should now have deployed in the Azure portal;
- Azure Batch Linked Service – I would like to say that when presented with the JSON template for the ABS that filling in the gaps is pretty intuitive for even the most none technical peeps amongst us. However the names and descriptions are wrong within the typeProperties component! Here’s my version below with the corrections and elaborations on the standard Visual Studio template. Please extend your sympathies for the pain it took me to figure out where the values don’t match the attribute tags!
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/
Microsoft.DataFactory.LinkedService.json",
"name": "AzureBatchLinkedService1",
"properties": {
"type": "AzureBatch",
"typeProperties": {
"accountName": "",
//Fine - get it from the portal, under service properties.
"accessKey": "",
//Fine - get it from the portal, under service properties.
"poolName": "",
//WRONG - this actually needs to be the pool ID
//that you defined when you deployed the service.
//Using the Pool Name will error during deployment.
"batchUri": "",
//PARTLY WRONG - this does need to be the full URI that you
//get from the portal. You need to exclude the batch
//account name. So just something like https://northeurope.batch.azure.com
//depending on your region.
//With the full URI you'll get a message that the service can't be found!
"linkedServiceName": ""
//Fine - as defined in your Data Factory. Not the storage
//account name from the portal.
}
}
}
- Azure Storage Linked Service – the JSON template here is ok to trust. It only requires the connection string for your blob store which can be retrieved from the Azure Portal and inserted in full. Nice simple authentication.
Once we have the linked services in place lets add the pipeline. Its worth noting that by pipeline I mean the ADF component that houses our activities. A pipeline is not the entire ADF end to end solution in this context. Many people do use it as a broad term for all ADF things incorrectly.
- Dot Net Activity – here we need to give ADF all the bits it needs to go away and execute our C#. Which is again defined in the typeProperties. Below is a JSON snippet of just the typeProperties block that I’ve commented on to go into more detail about each attribute.
"typeProperties": {
"assemblyName": "",
//Once your C# class library has been built the DLL name will come from the name as the
//project in Visual Studio by default. You can also change this in the project properties
//if you wish.
"entryPoint": "",
//This needs to include the namespace as well as the class. Which is what the default is
//alluding to where the dot separation is used. Typically your namespace will be inheritated
//from the project default. You might override this to be the CS filename though so be careful.
"packageLinkedService": "",
//Just to the clear. Your storage account linked service name.
"packageFile": ""
//Here's the ZIP file. If you haven't already you'll need to create a container in your
//storage account under blobs. Reference that here. The ZIP filename will be the same
//as the DDL file name. Don't worry about where the ZIP files gets created just yet.
}
By now you should have a solution that looks something like the solution explorer panel on the right. In mine I’ve kept all the default naming conventions for ease of understanding.
Deployment Time
If you have all the glue in place you can now right click on your ADF project and select Publish. This launches a wizard which takes you through the deployment process. Again I’ve made an assumption here that you are logged into Visual Studio with the correct credentials for your Azure subscription. The wizard will guide you through where the ADF project is going to be deployed, it will also validate the JSON content before sending it up and it will also detect if files in the target ADF service can be deleted.
With the reference in place to the C# class library the deployment wizard will detect the project dependency and zip up the compiled DLLs from your bin folder and upload them into the blob storage linked service referenced in the activity pipeline.
Sadly there is no local testing available for this lot and we just have to develop by trial/deploy/run and error.
Runtime
To help with debugging from the portal if you go to the ADF Monitor & Manage area you should have your pipeline displayed. Clicking on the custom activity block will reveal the log files in the right hand panel. The first is the default system stack trace and the other is anything written out by the C# logger.Write call(s). These will become your new best friend when trying to figure out what isn’t working.
Of course you don’t need to perform a full publish of the ADF project every time if your only developing the C# code. Simply build the solution and upload a new ZIP file to your blob storage account using something like Microsoft Azure Storage Explorer. Then rerun the time slice for the output dataset.
If nothing appears to be happening you may also want to check on your ABS to ensure tasks are being created from ADF. If you haven’t assigned the compute pool any CPU cores it will just sit there and your ADF pipeline activity will time out with no errors and no clues as to what might have gone wrong. Trust me, I’ve been there too.
I hope this post was helpful and gave you a steer as to the requirements for extending your existing ADF solutions with .Net activity.
Many thanks for reading.
Thanks for this post, it was really helpful!
Hello Paul,
Thanks for a great post.
I want to extract data for two entities/tables. Do i need to create two separate class for each (customer & vendor)?
Let me know
Thanks
Nutan
Thanks for sharing!
Would be great if you could post the code of your pipeline. I am wondering what to put in input and output datasets (which in my case are nothing)
Hi Isaac, thanks for your comment. For the pipeline I would suggest just use the template for a ‘DotNetActivity’ available in the Visual Studio project. For the datasets, it’s perfectly available not to have an input or an output. Even if they are fake place holders just for the purposes of dependency handling. Cheers
I am using VS2017 which have no Data Lake Factory template. Also I am creating the custom activity to move data from SFTP location to Azure Data Lake Store. In this case where we will store the activity dll. Zip formate.
Hi Pawan, thanks for your comment. For now you’ll need to use Visual Studio 2015. The tooling for VS2017 is coming soon, but it has been delayed due to other ADF service developments (which is all I can say publicly). In your solution once you have the ADF project with a reference to the class library that is where the DLL will come from. Cheers
Thanks Paul for your suggestions. One question can we pass output dataset from one activity to another custom activity.
For my case I need to avoid the already processed file from Data lake Store. I am using U-SQL script to insert record in U-SQL managed table from a Data Lake store folder.So I am trying to pass list of file name which hassle been processed from U-SQL and send this list to .net Custom Activity. And .net custom activity will move those files in some defferent location/rename it.
In the custom activity I would suggest querying the ADL store directory directly to get the list you need. The input dataset could just be a placeholder including the path and nothing else.
Thanks
Thanks Paul, In which folder level we will store the .Zip File. Also Now I am going to use VS2015, so If will deploy this from VS2015, where it will sore the .zip binaries?
I have been trying to move data from on-premise folder to Data Lake storage. Using wizard was ok, but is impossible to schedule a job to do this automatically every day.
Complexity to setup a pipeline and wait until this is complete it is “baffling”.
Unfortunately I am not able to create simple tasks as copy files with this tool and productivity is really poor, in my opinion is a really complex tool and hard to understand if you compare with regular SQL Server Agent Job.
I am able to understand the big number of possibilities to schedule different recurrent jobs, “only if you are able to config pipelines”.
Do you recommend another tool-technique avoiding ADF? e.g. Azure Automation (here I don’t know if is possible to upload information to Azure SQL) but I was able to execute Azure Data Lake procedures and different type of task with PowerShell
Thanks
Hi, thanks for your comments. I do empathise that ADF can seem baffling and frustrating when you start using it for the first time. However, your requirements certainly aren’t impossible. I would recommend you avoid using the wizard. As with most wizards it hides lots of things that require some understanding before you can proceed. Good luck and don’t give up.
1. Compile the project…I get an error that DF (in VS15) cannot find the .zip file.
2. Ok, so I try to publish the pipeline and it compiles the project…and errors.
I cannot get it to make the .zip file to save my life. Which DLLs should be included in the zip file? My hope is that by making the zip file the first time, I can get past this error and move on to bigger and better things…at least that is the hope.
Thanks, David
Hi David, thanks for your comment. It sounds like you just need to reference your class library project in your ADF project. Cheers Paul
How are you authenticating? I am trying to run code to manage an Azure Data Lake Store but cannot figure out how to do so. I have tried a number of different things, but all end up with me a piece away from completing the puzzle…only to find out the puzzle never had that piece. (ex. https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/azure-resource-manager/resource-group-create-service-principal-portal.md#required-permissions…the .Net link is broken) I would GREATLY appreciate help in bridging this last gap.
Hi David, the ADF .Net activity requires its own AAD service principal to authenticate against ADL store. Please check out one of my other blog posts on it: https://www.purplefrog.ascendancydev3.co.uk2016/12/azure-data-lake-authentication-from-azure-data-factory/ Cheers Paul
Do you have any sample code on how to create a custom connector for ADF. I need to pull some data from File Share and there is no connector available.
Hi Pierre, by custom connector do you mean a custom linked service? Thanks
Yes
Hi Paul,
I implemented your code. ADF is created. No issue. My question is, do you have any sample code that shows how to consume the dataset in c#. My understanding of the code here, I am just getting a reference to the datasets. How to I use them they seem to be all azure objects. I would like to create a console app to simulate the process. Am I too ambitious?
Hi Pierre, thanks for the comment. Consuming the dataset will greatly depend on what you want to do with it. There is no quick win here, you need to write the C# to do what you need for any row level transforms that may be required. Yes, a console app would be a great way to test it. I often switch my C# projects between class libraries and console apps in the project properties for development. Cheers Paul
Hi Paul, this was my initial question: do you have any sample code that shows how to consume the azure dataset in c#. From your code where you say “do stuff”. Anything is appreciated. Thanks.
Hi Paul,
When I create my APP is defaulting to V1. I have VS2015 how to I force to switch to V2. I run the nuggets from Microsoft but no success.