Unlike T-SQL, U-SQL does not currently support MERGE statements. Our friend that we have come to know and love since its introduction in SQL Server 2008. Not only that, but U-SQL also doesn’t currently support UPDATE statements either… I know… Open mouth emoji required! This immediately leads to the problem of change detection in our data and how, for example, we should handle the ingestion of a daily rolling 28-day TSV extract, requiring a complete year to date output. Well in this post we will solve that very problem.
Now before we move on it’s worth pointing out that U-SQL is now our/my language of choice for working with big data in Azure using the Data Lake Analytics service. It’s not yet the way of things for our on premises SQL Server databases, so relax. T-SQL or R are still our out-of-the-box tools there (SQL Server 2016). Also if you want to take a step back from this slightly deeper U-SQL topic and find out What is U-SQL first, I can recommend my purple amphibious colleagues blog, link below.
Assuming you’re comfortable with the U-SQL basics let’s move on. For the below examples I’m working with Azure Data Lake (ADL) Analytics, deployed in my Azure MSDN subscription. Although we can do everything here in the local Visual Studio emulator, without the cloud service (very cool and handy for development). I also have the Visual Studio Data Lake tools for the service available and installed. Specifically for this topic I have created a ‘U-SQL Sample Application’ project to get us started. This is simply for ease of explanation and so you can get most of the setup code for what I’m doing here without any great difficulty. Visual Studio Data Lake tools download link below if needed.
Once we have this solution available including its Ambulance and Search Log samples please find where your Visual Studio Cloud Explorer panel (Ctrl + \, Ctrl + X) is hiding as we’ll use this to access the local Data Lake Analytics database on your machine.
To get things rolling open the U-SQL file from the sample project called ‘SearchLog-4-CreatingTable’ and execute AKA ‘Submit’ this to run locally against your ADL Analytics instance. This gives us a database and target table to work with for the merge operation. It also inserts some tab separated sample data into the table. If we don’t insert this initial dataset you’ll find joining to an empty table will prove troublesome.
Now U-SQL is all about extracting and outputting at scale. There isn’t any syntax sugar to merge datasets. But do we really need the sugar? No. So, we are going to use a database table as our archive or holding area to ensure we get the desired ‘upsert’ behaviour. Then write our MERGE statement long hand using a series of conventional joins. Not syntax sugar. Just good old fashioned joining of datasets to get the old, the new and the changed.
Recap of the scenario; we have a daily rolling 28-day input, requiring a full year to date output.
Next open the U-SQL file from the sample project called ‘SearchLog-1-First_U-SQL_Script’. This is a reasonable template to adapt as it contains the EXTRACT and OUTPUT code blocks already.
For the MERGE we next need a set of three SELECT statements joining both the EXTRACT (new/changed data) and table (old data) together. These are as follows, in mostly English type syntax first 🙂
- For the UPDATE, we’ll do an INNER JOIN. Table to file. Taking fields from the EXTRACT only.
- For the INSERT, we’ll do a LEFT OUTER JOIN. File to table. Taking fields from the EXTRACT where NULL in the table.
- To retain old data, we’ll do a RIGHT OUTER JOIN. File to table. Taking fields from the table where NULL in the file.
Each of the three SELECT statements can then have UNION ALL conditions between them to form a complete dataset including any changed values, new values and old values loaded by a previous file. This is the code you’ll want to add for the example in your open file between the extract and output code blocks. Please don’t just copy and paste without understanding it.
@AllData = --update current SELECT e1.[UserId], e1.[Start], e1.[Region], e1.[Query], e1.[Duration], e1.[Urls], e1.[ClickedUrls] FROM [SearchLogDemo].[dbo].[SearchLog] AS t1 INNER JOIN @searchlog AS e1 ON t1.[UserId] == e1.[UserId] UNION ALL --insert new SELECT e2.[UserId], e2.[Start], e2.[Region], e2.[Query], e2.[Duration], e2.[Urls], e2.[ClickedUrls] FROM @searchlog AS e2 LEFT OUTER JOIN [SearchLogDemo].[dbo].[SearchLog] AS t2 ON t2.[UserId] == e2.[UserId] WHERE t2.[UserId] IS NULL UNION ALL --keep existing SELECT t3.[UserId], t3.[Start], t3.[Region], t3.[Query], t3.[Duration], t3.[Urls], t3.[ClickedUrls] FROM @searchlog AS e3 RIGHT OUTER JOIN [SearchLogDemo].[dbo].[SearchLog] AS t3 ON t3.[UserId] == e3.[UserId] WHERE e3.[UserId] IS NULL;
This union of data can then OUTPUT to our usable destination doing what U-SQL does well before resetting our ADL Analytics table for the next load. By reset, I mean TRUNCATE the table and INSERT everything from @AllData back into it. This preserves our history/our old data and allows the MERGE behaviour to work again and again using only SELECT statements.
Replacing the OUTPUT variable from @searchlog, you’ll then want to add the following code below the three SELECT statements.
OUTPUT @AllData TO "/output/SearchLogAllData.csv" USING Outputters.Csv(); TRUNCATE TABLE [SearchLogDemo].[dbo].[SearchLog]; INSERT INTO [SearchLogDemo].[dbo].[SearchLog] ( [UserId], [Start], [Region], [Query], [Duration], [Urls], [ClickedUrls] ) SELECT [UserId], [Start], [Region], [Query], [Duration], [Urls], [ClickedUrls] FROM @AllData;
If all goes well you can edit the ‘SearchLog.tsv’ file removing and changing data and keep rerunning the U-SQL script performing the MERGE behaviour. Please test away. Don’t just believe me that it works. As a bonus you get this pretty job diagram too…
The only caveat here is that we can’t deal with deletion detection from the source file… Unless we do something a little more complex for the current loading period. Lets save that for a later blog post.
A couple of follow up general tips.
- Have a USE condition at the top of your scripts to ensure you hit the correct database. Just like T-SQL.
- If your struggling for fields to join on as you don’t have a primary key. You could use UNION instead of UNION ALL. But this of course takes more effort to work out the distinct values. Just like T-SQL.
- Be careful with C# data types and case sensitivity. U-SQL is not as casual as T-SQL with such things.
That’s it. U-SQL merge behaviour achieved. I guess the bigger lesson here for the many T-SQL people out there is; don’t forget the basics, its still a structured query language. Syntax sugar is sweet, but not essential.
Hope this was helpful.
Many thanks for reading