As I work with Azure Data Factory (ADF) and help others in the community more and more I encounter some confusion that seems to exist surrounding how to construct a complete dependency driven ADF solution. One that chains multiple executions and handles all of your requirements. In this post I hope to address some of that confusion and will allude to some emerging best practices for Azure Data Factory usage.
First a few simple questions:
- Why is there confusion? In my opinion this is because the ADF copy wizard available via the Azure portal doesn’t help you architect a complete solution. It can be handy to reverse certain things, but really the wizard tells you nothing about the choices you make and what the JSON behind it is doing. Like most wizards, it just leads to bad practices!
- Do I need several data factory services for different business functions? No, you don’t have to. Pipelines within a single data factory service can be disconnected for different processes and often having all your linked services in one place is easier to manage. Plus a single factor offers reusability and means I single set of source code etc.
- Do I need one pipeline per activity. No, you can house many activities in a single pipeline. Pipelines are just logic containers to assist you when managing data orchestration tasks. If you want an SSIS comparison, think of them as sequence containers. In a factory I may group all my on premises gateway uploads into a single pipeline. This means I can pause that stream of uploads on demand. Maybe when the gateway keys needs to be refreshed etc.
- Is the whole data factory a pipeline? Yes, in concept. But for technical terminology a pipeline is a specific ADF component. The marketing people do love to confuse us!
- Can an activity support multiple inputs and multiple outputs? Generally yes. But there are exceptions depending on the activity type. U-SQL calls to Azure Data Lake can have multiples of both. ADF doesn’t care as long as you know what the called service is doing. On the other hand a copy activity needs to be one to one (so Microsoft can charge more for data movements).
- Does an activity have to have an input dataset? No. For example, you can create a custom activity that executes your code for a defined time slice without an input dataset, just the output.
Moving on, lets go a little deeper and think about a scenario that I use in my community talks. We have an on premises CSV file. We want to upload it. Clean it and aggregate the output. For each stage of this process we need to define a dataset for Azure Data Factory to use.
To be clear, a dataset in this context is not the actual data. It is just a set of JSON instructions that defines where and how our data is stored. For example, its file path, its extension, its structure, its relationship to the executing time slice.
Lets define each of the datasets we need in ADF to complete the above scenario for just 1 file:
- The on premises version of the file. Linked to information about the data management gateway to be used, with local credentials and file server/path where it can be accessed.
- A raw Azure version of the file. Linked to information about the data lake storage folder to be used for landing the uploaded file.
- A clean version of the file. Linked to information about the output directory of the cleaning process.
- The aggregated output file. Linked to information about the output directory of the query being used to do the aggregation.
All of the linked information to these datasets should come from your ADF linked services.
So, we have 1 file to process, but in ADF we now need 4 datasets defined for each stage of the data flow. These datasets don’t need to be complex, something as simple as the following bit of JSON will do.
"structure": [ ],
Next, our activities. Now the datasets are defined above we need ADF to invoke the services that are going to do the work for each stage. As follows:
|Activity (JSON Value)
|Upload file from local storage to Data Lake storage.
|Perform transformation/cleaning on raw source file.
|Aggregate the datasets to produce a reporting output.
From the above table we can clearly see the output dataset of the first activity becomes the input of the second. The output dataset of the second activity becomes the input of the third. Apologies if this seems obvious, but I have know it to confuse people.
For our ADF pipeline(s) we can now make some decisions about how we want to manage the data flow.
- Add all the activities to a single pipeline meaning we can stop/start everything for this 1 dataset end to end.
- Add each activity to a different pipeline dependant on its type. This is my starting preference.
- Have the on premises upload in one pipeline and everything else in a second pipeline.
- Maybe separate your pipelines and data flows depending on the type of data. Eg. Fact/dimension. Finance and HR.
The point here, is that it doesn’t matter to ADF, it’s just down to how you want to control it. When I created the pipelines for my talk demo I went with option 2. Meaning I get the following pretty diagram, arranged to fit the width of my blog 🙂
Here we can clearly see at the top level each dataset flowing into a pipeline and its child activity. If I’m constructed this using option 1 above I would simply see the first dataset and the fourth with 1 pipeline box. I could then drill into the pipeline to see the chain activities within. A repeat, this doesn’t matter to ADF.
I hope you found the above useful and a good starting point for constructing your ADF data flows.
As our understanding of Azure Data Factory matures I’m sure some of the following points will need to be re-written, but for now I’m happy to go first and start laying the ground work of what I consider to be best for ADF usage. Comments very welcome.
- Resist using the wizard, please.
- Keep everything within a single ADF service if you can. Meaning linked services can be reused.
- Disconnect your on premises uploads using a single pipeline. For ease of management.
- Group your activities into natural pipeline containers for the operation type or data category.
- Layout your ADF diagram carefully. Left to right. It makes understanding it much easier for others.
- Use Visual Studio configuration files to deploy ADF projects between Dev/Test/Live. Ease of source control and development.
- Monitor activity concurrency and time outs carefully. ADF will kill called service executions if breached.
- Be mindful of activity cost and group inputs/outputs for data compute where possible.
- Use time slices to control your data volumes. Eg. Pass the time slice as a parameter to the called compute service.
What next? Well, I’m currently working on this beast…
- 127x datasets.
- 71x activities.
- 9x pipelines.
… and I’ve got about another third left to build!
Many thanks for reading.