Let’s try and keep this post short and sweet. Diving right in imagine a scenario where we have an Azure Data Factory (ADF) pipeline that includes activities to perform U-SQL jobs in Azure Data Lake (ADL) Analytics. We want to control the U-SQL by passing the ADF time slice value to the script, hopefully a fairly common use case. This isn’t yet that intuitive when constructing the ADF JSON activity so I hope this post will save you some debugging time.
For my example I’ve created a stored procedure in my ADL Analytics database that accepts a parameter @TimeSliceStart as a string value in the format yyyyMMdd.
[ExampleDatabase].[dbo].[usp_DoSomeStuff](@TimeSliceStart);
This doesn’t have to be a stored procedure. ADF is also happy if you give if U-SQL files or even just inline the entire script. Regardless, the ADF parameter handling is the same.
In my ADF JSON activity I then have the following;
{
"name": "DataLakeJob1",
"properties": {
"description": "Run USQL with timeslice param",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"script": "[ExampleDatabase].[dbo].[usp_DoSomeStuff](@TimeSliceStart);",
"degreeOfParallelism": 5,
"priority": 1,
"parameters": {
"TimeSliceStart": "$$Text.Format('{0:yyyyMMdd}', Time.AddMinutes(SliceStart, 0))"
}
} // etc ... } ] // etc ...}}}
Notice in the extended properties we have a parameters attribute that can include children of the actual variables we want to pass to the U-SQL script.
Here’s the important things to understand about this ADF parameters attribute.
- The name of the parameter must match the name of the variable expected by U-SQL exactly.
- As you would expect the data types of the expected variable and JSON parameter must match.
- It is perfectly acceptable to have multiple parameters in the ADF JSON and written in any order.
So how does this work?…
What ADF does when calling ADL is take the parameters listed in the JSON and write out a bunch of U-SQL ‘DECLARE @Variable’ lines. These then get appended to the top of the actual U-SQL script before giving it to ADL as a job to run. You can see this if you go into the ADL Analytics blades in the Azure portal, select the job created by the ADF activity then choose Duplicate Script. This reveals the actual U-SQL used in the job.
Here’s the proof.
Then…
Just knowing what ADF does when converting the JSON parameters to U-SQL declared variables is the main take away here.
That’s it! I promised short and sweet.
Many thanks for reading.
One remark:
In order to use the parameters during devleopment of the script, you can declare them as EXTERNAL. For your script you could create a row with
DECLARE EXTERNAL @TimeSliceStart string = “20161101”;
Then development can test this dataset and the use of the parameter in the script. When Data Factory runs it will add the DECLARE @TimeSliceStart string parameter, that will override DECLARE EXTERNAL.
And one question:
You say that the datatype of the parameters should match. But have you found a way to pass parameters as integers? I tried this but got a Data Factory error that the value of the parameter had to be string. Solved this by inputting a string, and parsing it to another parameter that was integer in U-SQL.
Thanks for this- I was ripping my hair out digging through the ADLA .NET SDK trying to figure out how Data Factory did it. Running into “To be added.” in the documentation constantly doesn’t help.
Hi Samara, thanks for you comment 🙂 Always good to know that people are finding the content useful.
hi Paul,
Can be parameter value be substitute from within a config? I try and it doesn’t substitute.
I use .properties.activities[0].typeProperties.parameters.paramName.
Never mind 🙂 Silly me. Made a typo in parameter’s name.
Thanks for your nice blog. I have tried to do an u-sql script, which includes ref assemblies and behind C# code and then executing it from ADF, as you have described, but I keep getting the following error.
E_CSC_USER_DDLENTITYDOESNOTEXIST
Message
Procedure ‘XXXXX.dbo.scriptxx’ does not exist.
Where should the usql script and other files be located? or do you have any other idea?
Thanks
Jakob
Hi Jakob, thanks for your comment. Yes, when calling USQL scripts from ADF that require C# methods you’ll need to build them separately and add the DLL’s as referenced assemblies. This is because ADF can only pass the USQL to ADL and not its built code behind file. I think this is problem with the service, but it can be worked around. I’ll probably write a separate blog post about this, given you’ve mentioned it. Thanks again.
Hi Poul,
Thanks for your fast feedback, I did think the issue with the code behind, but still not success. Have you done anything special when uploading the initial job usp_DoSomeStuff? I did all my upload and development using VS2015.
Thanks
Jakob
Hi Jakob, nothing special. But if C# is required I compile it in Visual Studio as a class library as a separate project, but in the same solution as my data factory and data lake projects. Then I upload the DLL into blob storage and create the assembly in data lake analytics. Finally the procedure just needs a reference to the assembly in order to access the class methods.
Hi Poul,
I got it. All I needed was to create by script as a procedure, using CREATE PROCEDURE. now it works
Thanks again
Jakob
Can’t pass a character as a parameter, as it gives the error “All strings must be enclosed in double quotes (“)”
Hi Kityy, thanks for your comment. Yes, you will need to be careful with the JSON injection into the USQL. I’d suggest checking the ADL job to see what ADF has passed for execution.
Hi Paul,
we need to constantly perform calculations based on the input data we are receiving (refresh intervalls of ~5-10 minutes). The jobs themselves aren’t extremely complicated (just a couple of aggregates and calcuations of timespans) but they should run pretty frequently as they are used for monitoring industrial production.
Do you think Data Factory along with U-SQL-Jobs is the right tool for the job?
Analytics-Wise U-SQL offers all we need but i’m a bit concerned about the scheduling part.
Data Factory has a limitation on the minimum frequency of 15 minutes to run jobs (according to https://docs.microsoft.com/en-us/azure/azure-subscription-service-limits#data-factory-limits).
The only other scheduling option i’m aware of is Azure Scheduler along with custom code to execute the jobs.
Do you have any suggestions for us?
Thank you very much!
Hi Markus, thanks for your comments. You are correct about the current frequency limit in Azure Data Factory. However, if you require something more regularly then its certainly possible to have multiple activities within a single pipeline in ADF, but calling the same USQL with time offsets to achieve the required throughput. I’d be happy to discuss and advise on the ADF design. Feel free to email sales@purplefrogsystems.com. Many thanks
How can avoid already processed file from U-SQL?
Hi Pawan, thanks for the question. I would suggest you pass the ADF time slice to your U-SQL script. Then include the value as part of your query WHERE clause. Check out this post as a starting point: https://www.purplefrog.ascendancydev3.co.uk2017/02/passing-parameters-to-u-sql-from-azure-data-factory/
Hello Paul,
Can you please advise how can we move the file after processing it through U-SQL. I am thinking of calling C# code in U-SQL after all processing to move the file to processed folder.
Please advise
Hi Gaurav
Thanks for your comments. Depending on your requirements I’d suggest using Azure Data Factory.
Cheers
Paul
Thanks for you response.
Does it mean.. Use ADF which call U-SQL Script. After processing how can i move the file.
Please suggest.
You need two ADF activities. The first to execute the USQL. Then the second to perform the data movement.
Thanks
Hi Paul,
I am also having the same sort of concern. I am using ADF to call U-SQL script for processing files. After processing, how can I move the file from ADLS to BLOB STORAGE depending on some custom logic?
Thanks,
Suchismita
Hi Suchismita
Thanks for your comment. As mentioned, I suggest using Azure Data Factory. This can handle the execution of the U-SQL and then with a downstream activity the movement of the dataset to other storage services.
Kind regards
Paul
If i have two activity one to execute U-SQL script and second activity to move the same file after processing.
It can be possible that second activity move the file first even first activity does not process the same file using U-SQL.
We are using V1 ADF and how can we control that first activity with U-SQL must completed first and than second activity move the file.
Thanks,
Gaurav sharma
Hi Gaurav, please refer to my email sent on 28th Sept. Many thanks
Hi Paul,
It is very much useful for me. Have 1question: When a U-SQL activity being executed from ADF, in ADLA it will create a job with some default name (Ex: ADF-9huu06-abcd) as shown above. I would like to know if we can specify a custom name, so that it would be easy to identify the job. Hope my question was clear. Many thanks.
Hi Prabhakar, thanks for the comment. Yes, that makes sense. As I understand Microsoft are already aware of this request and it is in the backlog of changes that can be voted for via the feedback website. Cheers
Thank you a lot Paul.
Hello, thank’s for this good article !
Do you know if it’s possible to do the same with .NET SDK ?
Thanks.
Hi Valentin, thanks for your comment. Great question. I don’t know off hand. Try it! Cheers Paul