Unlike T-SQL, U-SQL does not currently support MERGE statements. Our friend that we have come to know and love since its introduction in SQL Server 2008. Not only that, but U-SQL also doesn’t currently support UPDATE statements either… I know… Open mouth emoji required! This immediately leads to the problem of change detection in our data and how, for example, we should handle the ingestion of a daily rolling 28-day TSV extract, requiring a complete year to date output. Well in this post we will solve that very problem.
Now before we move on it’s worth pointing out that U-SQL is now our/my language of choice for working with big data in Azure using the Data Lake Analytics service. It’s not yet the way of things for our on premises SQL Server databases, so relax. T-SQL or R are still our out-of-the-box tools there (SQL Server 2016). Also if you want to take a step back from this slightly deeper U-SQL topic and find out What is U-SQL first, I can recommend my purple amphibious colleagues blog, link below.
Assuming you’re comfortable with the U-SQL basics let’s move on. For the below examples I’m working with Azure Data Lake (ADL) Analytics, deployed in my Azure MSDN subscription. Although we can do everything here in the local Visual Studio emulator, without the cloud service (very cool and handy for development). I also have the Visual Studio Data Lake tools for the service available and installed. Specifically for this topic I have created a ‘U-SQL Sample Application’ project to get us started. This is simply for ease of explanation and so you can get most of the setup code for what I’m doing here without any great difficulty. Visual Studio Data Lake tools download link below if needed.
Once we have this solution available including its Ambulance and Search Log samples please find where your Visual Studio Cloud Explorer panel (Ctrl + \, Ctrl + X) is hiding as we’ll use this to access the local Data Lake Analytics database on your machine.
To get things rolling open the U-SQL file from the sample project called ‘SearchLog-4-CreatingTable’ and execute AKA ‘Submit’ this to run locally against your ADL Analytics instance. This gives us a database and target table to work with for the merge operation. It also inserts some tab separated sample data into the table. If we don’t insert this initial dataset you’ll find joining to an empty table will prove troublesome.
Now U-SQL is all about extracting and outputting at scale. There isn’t any syntax sugar to merge datasets. But do we really need the sugar? No. So, we are going to use a database table as our archive or holding area to ensure we get the desired ‘upsert’ behaviour. Then write our MERGE statement long hand using a series of conventional joins. Not syntax sugar. Just good old fashioned joining of datasets to get the old, the new and the changed.
Recap of the scenario; we have a daily rolling 28-day input, requiring a full year to date output.
Next open the U-SQL file from the sample project called ‘SearchLog-1-First_U-SQL_Script’. This is a reasonable template to adapt as it contains the EXTRACT and OUTPUT code blocks already.
For the MERGE we next need a set of three SELECT statements joining both the EXTRACT (new/changed data) and table (old data) together. These are as follows, in mostly English type syntax first 🙂
- For the UPDATE, we’ll do an INNER JOIN. Table to file. Taking fields from the EXTRACT only.
- For the INSERT, we’ll do a LEFT OUTER JOIN. File to table. Taking fields from the EXTRACT where NULL in the table.
- To retain old data, we’ll do a RIGHT OUTER JOIN. File to table. Taking fields from the table where NULL in the file.
Each of the three SELECT statements can then have UNION ALL conditions between them to form a complete dataset including any changed values, new values and old values loaded by a previous file. This is the code you’ll want to add for the example in your open file between the extract and output code blocks. Please don’t just copy and paste without understanding it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
@AllData = --update current SELECT e1.[UserId], e1.[START], e1.[Region], e1.[Query], e1.[Duration], e1.[Urls], e1.[ClickedUrls] FROM [SearchLogDemo].[dbo].[SearchLog] AS t1 INNER JOIN @searchlog AS e1 ON t1.[UserId] == e1.[UserId] UNION ALL --insert new SELECT e2.[UserId], e2.[START], e2.[Region], e2.[Query], e2.[Duration], e2.[Urls], e2.[ClickedUrls] FROM @searchlog AS e2 LEFT OUTER JOIN [SearchLogDemo].[dbo].[SearchLog] AS t2 ON t2.[UserId] == e2.[UserId] WHERE t2.[UserId] IS NULL UNION ALL --keep existing SELECT t3.[UserId], t3.[START], t3.[Region], t3.[Query], t3.[Duration], t3.[Urls], t3.[ClickedUrls] FROM @searchlog AS e3 RIGHT OUTER JOIN [SearchLogDemo].[dbo].[SearchLog] AS t3 ON t3.[UserId] == e3.[UserId] WHERE e3.[UserId] IS NULL;
This union of data can then OUTPUT to our usable destination doing what U-SQL does well before resetting our ADL Analytics table for the next load. By reset, I mean TRUNCATE the table and INSERT everything from @AllData back into it. This preserves our history/our old data and allows the MERGE behaviour to work again and again using only SELECT statements.
Replacing the OUTPUT variable from @searchlog, you’ll then want to add the following code below the three SELECT statements.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
OUTPUT @AllData TO "/output/SearchLogAllData.csv" USING Outputters.Csv(); TRUNCATE TABLE [SearchLogDemo].[dbo].[SearchLog]; INSERT INTO [SearchLogDemo].[dbo].[SearchLog] ( [UserId], [START], [Region], [Query], [Duration], [Urls], [ClickedUrls] ) SELECT [UserId], [START], [Region], [Query], [Duration], [Urls], [ClickedUrls] FROM @AllData;
If all goes well you can edit the ‘SearchLog.tsv’ file removing and changing data and keep rerunning the U-SQL script performing the MERGE behaviour. Please test away. Don’t just believe me that it works. As a bonus you get this pretty job diagram too…
The only caveat here is that we can’t deal with deletion detection from the source file… Unless we do something a little more complex for the current loading period. Lets save that for a later blog post.
A couple of follow up general tips.
- Have a USE condition at the top of your scripts to ensure you hit the correct database. Just like T-SQL.
- If your struggling for fields to join on as you don’t have a primary key. You could use UNION instead of UNION ALL. But this of course takes more effort to work out the distinct values. Just like T-SQL.
- Be careful with C# data types and case sensitivity. U-SQL is not as casual as T-SQL with such things.
That’s it. U-SQL merge behaviour achieved. I guess the bigger lesson here for the many T-SQL people out there is; don’t forget the basics, its still a structured query language. Syntax sugar is sweet, but not essential.
Hope this was helpful.
Many thanks for reading
Warning, this is going to be a dry post. Beer or water required before you continue reading.
For those that don’t already know Azure Stream Analytics is Microsoft’s cloud based service for the handling and manipulation of data feeds, predominantly in real-time. However the service has many applications with the ability to ingest a wide variety of data sources. One such use case for the analysis of data using the service comes when performing aggregations on the stream received by the job as an input. Like conventional querying the aggregation would be done against some grouping of the data set. Stream Analytics is no exception, however we need to understand that where we would previously alter our query to aggregate things by different groups of attributes (the query changes). We now fix our query when starting the job and stream data past it (the data changes). It is this shift in mind-set from a fixed dataset to a data stream that means we need to consider different grouping conditions for the attributes presented.
In Stream Analytics because, as the name suggests, data is streamed to/through the service time becomes the key for the grouping we impose for our aggregations. It is this time period for the grouping that is referred to as the window. To better understand what these windows look like lets compare some query syntax that we might use to produce a count on a dataset by day vs how we would aggregate this in a data stream by a window of day.
If only everything could be explained by comparing a SQL like syntax for two different technologies 🙂
Assuming SQL is part of your brain parser we can predict from the above queries what results we are expecting to see. However for the data stream if nothing is received by our input within the defined window the count will be zero. I repeat the data changes, not the query as it’s streamed through the service. Thinking about the opposite of that for the above dataset query the results will only grow to include many day values as more data is inserted into the table. Where as the results in data stream will always return only one value as the data streams through the service and the window moves with the defined time period in the grouping.
Hopefully we now understand what a window is in our Stream Analytics job. Both in English and SQL!
If you still aren’t sure try this 1 liner from Microsoft’s MSDN page.
A window contains event data along a timeline and enables you to perform various operations against the events within that window.
Next then, window types. This window of time for our stream of data can then take 3x different forms depending on what we need to report on; sliding, tumbling and hoping. Lets explore each in more detail with some potential real world examples to give there purpose some context. For the examples I’m going to use the health services as a basis for my scenarios.
As a name suggests this first type of Stream Analytics windows slides with time. It has a defined size or duration and once set will move forwards aggregating any values in its scope.
1 2 3 4 5 6 7 8 9 10 11 12
SELECT DateAdd(hour,-1,System.TimeStamp) AS 'WinStartTime', System.TimeStamp AS 'WinEndTime', AVG([WaitTime]) AS 'SenorValue' INTO [TargetAlerts] FROM [SystemInput] TIMESTAMP BY [CreatedDateTime] GROUP BY SlidingWindow(Duration(hour, 1)) --window criteria
We are running an A&E department and we want to know the average wait time for all patients currently in the waiting room with a window duration of 1 hour. We may then want alerts if the current window value exceeds a set tolerance. What else happened in that window to clause that? Or we could just provide this information to waiting patients on a TV screen with a note that the sample is from a fixed size window. Call centres often provide there average wait time to answer the phone without any context of the window in which the aggregation was taken resulting in mislead expectations.
A tumbling window is a little harder to explain because how can time tumble. Well it doesn’t and nothing about tumbling down something should be implied. Instead think of the day as 24 separate windows all 1 hour in size or duration. Values stream into our job in the current 1 hour window. Then as we tick into the next hour the aggregate tumbles over and resets for the new window.
1 2 3 4 5 6 7 8 9 10 11 12
SELECT DateAdd(day,-1,System.TimeStamp) AS 'WinStartTime', System.TimeStamp AS 'WinEndTime', COUNT([ReferralId]) AS 'SenorValue' INTO [DashboardOutput] FROM [SystemInput] TIMESTAMP BY [CreatedDateTime] GROUP BY TumblingWindow(Duration(day, 1), Offset(millisecond, -1)) --window criteria
We are monitoring referrals being entered onto our hospitals computer system with a tumbling window of 1 day. Each day we count the number of referrals entered and display the information on a real-time dashboard. Triggers could then be put in place if the referrals entered fall below a given tolerance. Or if they exceed expected amounts because data isn’t being entered uniformly throughout the week.
Hoping windows get a little easier to understand assuming we are comfortable with tumbling windows. The reason being is that a hop is the same as a tumble, but with an overlap in time. Sticking with our 24 hour day we could still have 24 hoping windows, but they now have a size or duration of 1.5hours meaning a 30 minute overlap.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
SELECT DateAdd(hour,-1,System.TimeStamp) AS 'WinStartTime', System.TimeStamp AS 'WinEndTime', [WardName], COUNT([PatientId]) AS 'SenorValue' INTO [DashboardOutput] FROM [SystemInput] TIMESTAMP BY [CreatedDateTime] GROUP BY [WardName], HoppingWindow(Duration(hour, 1), Hop(minute, 30), Offset(millisecond, -1)) --window criteria
We are monitoring the number of patients on our wards with a simple count. A patient is then moved between wards but the move involves escorting the patient by wheel chair to the other side of the hospital. We want to allow enough time for the physical move to occur in our window aggregation so we allow a 30 minute overlap in the tumble. This means the patient may be double counted for a brief time. But that might be considered a better position than them not appearing on any ward. Eg. They virtual disappeared.
I hope you found this post useful. I did say it was going to be dry. There are a few pictures on the MSDN pages which might help with understanding but for me they were a little static and needed some animation or colour to be properly representative of each window type.
Many thanks for reading.
When Battleships Met SQL Server, like Bailys meeting Coffee, but better.
Do you have an equal love of the classic two player board game Battleship and also SQL Server database development? If the answer is yes (which it obviously should be) then this is the blog post for you!
The next question that I’m more than happy to answer for you is; how can I combine these 2x loves?… Well my friend, please continue reading to find out.
So, let’s create ourselves a SQL Server database to play a game of Battleships using tables as our grid and ship layout, then let’s use stored procedures as our weapons in which to fire on your opponent (executing statements against their tables). The other unique advantage of playing SQL BattleShips is that both players don’t need to be in the same room verbally querying each other’s grid references and board. In other words, why use English to query this data from a person when you can use SQL to query a database table.
If you haven’t already done so, please download the zip file from the Downloads page in the database backups section labelled BattleShips.zip. It contains a SQL Server 2014 database backup file of the database called BattleShips. Please restore the database to a suitable SQL instance (and please check the assumption section below). I say a suitable SQL instance because apparently my previous employer didn’t think we needed it on every production server in the organisation!
A Few Assumptions
- You or the person creating the new game, which doesn’t have to be one of the players has sysadmin access to the SQL Server instance where the Battleships database has been restored
Why? The new game procedure alters various database objects and setups up SQL logins for each player.
- The SQL instance has database mail setup and is working.
Why? For ease of automation and sending each player details of what to do next. Including login details.
- The SQL instance is setup for mixed mode authentication.
Why? Each player will have a SQL authenticated login created on the SQL instance in order to be able to connect to the BattleShips database and play with their respective objects. The database users then have a set of permissions defined which prevents each player for seeing each other’s tables (ship board).
- You and your opponent are aware of how to edit/query database table data, plus views and execute stored procedures with required parameters in SQL Server Management Studio (SSMS).
Why? Because this is SQL BattleShips and it can’t be played using English queries.
Creating A New Game
To create a new game, execute the following procedure adding an email address for each player as required.
1 2 3 4 5 6
USE [BattleShips] GO EXEC [dbo].[NewGame] @Player1EmailAddress = Nfirstname.lastname@example.org', @Player2EmailAddress = Nemail@example.com'
Using database mail players will then receive details instructions on what to do and where to connect to using a set of SQL credentials provided. For information the body of this email is sorted in the [dbo].[EmailText].
To isolate players from each other as part of the game SQL authenticated user accounts are created for each player which only grants the necessary permissions to the respective player’s databases objects. This is handled using database schema’s and execute only permissions to certain procedures. Players will not have access to any other database on the SQL instance.
If you are a sysadmin you can of course very easily cheat, but this is not in the spirit of the game and it is expected that each player only uses their SQL account provided in the email sent out.
How To Play
Following the instructions provided by email, once you’ve connected to the BattleShips database in SSMS, edit your ‘board’ table. [playerX].[board]. Add all your available ships and close the table editor once complete.
When you are happy with the strategic planning (layout) execute the following stored procedure. This will validate your board layout and provide feedback if a ship is the incorrect length or has been missed. If everything is correct your status will be set to ready.
Once both players have completed this operation can battle commence! Query the table players if you are unsure of the other player’s status.
SELECT * FROM [dbo].[Players]
They may also be ready to play or still setting up their board. If you’re playing in the same room feel free to speak to each other, it’s not against the rules, just not in the sprite of SQL BattleShips.
When both players are ready start taking shots at each other’s ship grids in turn, with player 1 going first. This can be done by executing the following procedure. You’ll of course want to update the gird reference parameter value once you’ve eliminated A01.
EXEC [player1].[TakeShot] @GridReference = 'A01'
Following execution feedback will be returned about the success of your shot and the states of both your board and a grid of which coordinates you have already tried. If you use the same grid reference twice you’ll be warned so your shot isn’t wasted.
After a winner has been established by an opponent sinking all of their rival’s ships (or updating all of their ships with an X for hit). The players table will be updated with the winner’s status and the loser will no longer be able to take any more shot.
Happy hunting my friends and please give me your feedback on any epic battles, and maybe the coding!
Many thanks for reading.