Real-time Data

Azure Stream Analytics Windowing Queries

stream-analyticsWarning, this is going to be a dry post. Beer or water required before you continue reading.

For those that don’t already know Azure Stream Analytics is Microsoft’s cloud based service for the handling and manipulation of data feeds, predominantly in real-time. However the service has many applications with the ability to ingest a wide variety of data sources.  One such use case for the analysis of data using the service comes when performing aggregations on the stream received by the job as an input. Like conventional querying the aggregation would be done against some grouping of the data set. Stream Analytics is no exception, however we need to understand that where we would previously alter our query to aggregate things by different groups of attributes (the query changes). We now fix our query when starting the job and stream data past it (the data changes). It is this shift in mind-set from a fixed dataset to a data stream that means we need to consider different grouping conditions for the attributes presented.

In Stream Analytics because, as the name suggests, data is streamed to/through the service time becomes the key for the grouping we impose for our aggregations. It is this time period for the grouping that is referred to as the window. To better understand what these windows look like lets compare some query syntax that we might use to produce a count on a dataset by day vs how we would aggregate this in a data stream by a window of day.


Dataset Data Stream
dataset-query data-stream-query2

If only everything could be explained by comparing a SQL like syntax for two different technologies 🙂

Assuming SQL is part of your brain parser we can predict from the above queries what results we are expecting to see. However for the data stream if nothing is received by our input within the defined window the count will be zero. I repeat the data changes, not the query as it’s streamed through the service. Thinking about the opposite of that for the above dataset query the results will only grow to include many day values as more data is inserted into the table. Where as the results in data stream will always return only one value as the data streams through the service and the window moves with the defined time period in the grouping.

Hopefully we now understand what a window is in our Stream Analytics job. Both in English and SQL!

If you still aren’t sure try this 1 liner from Microsoft’s MSDN page.

A window contains event data along a timeline and enables you to perform various operations against the events within that window.

Reference: https://msdn.microsoft.com/en-us/library/azure/dn835019.aspx

Next then, window types. This window of time for our stream of data can then take 3x different forms depending on what we need to report on; sliding, tumbling and hoping. Lets explore each in more detail with some potential real world examples to give there purpose some context. For the examples I’m going to use the health services as a basis for my scenarios.

Sliding

As a name suggests this first type of Stream Analytics windows slides with time. It has a defined size or duration and once set will move forwards aggregating any values in its scope.

Query example:

1
2
3
4
5
6
7
8
9
10
11
12
SELECT 
	DateAdd(hour,-1,System.TimeStamp) AS 'WinStartTime',
	System.TimeStamp AS 'WinEndTime',
	AVG([WaitTime]) AS 'SenorValue'
INTO
	[TargetAlerts]
FROM 
	[SystemInput]
TIMESTAMP BY 
	[CreatedDateTime]
GROUP BY
	SlidingWindow(Duration(hour, 1)) --window criteria

Scenario:

We are running an A&E department and we want to know the average wait time for all patients currently in the waiting room with a window duration of 1 hour. We may then want alerts if the current window value exceeds a set tolerance. What else happened in that window to clause that? Or we could just provide this information to waiting patients on a TV screen with a note that the sample is from a fixed size window. Call centres often provide there average wait time to answer the phone without any context of the window in which the aggregation was taken resulting in mislead expectations.

Tumbling

A tumbling window is a little harder to explain because how can time tumble. Well it doesn’t and nothing about tumbling down something should be implied. Instead think of the day as 24 separate windows all 1 hour in size or duration. Values stream into our job in the current 1 hour window. Then as we tick into the next hour the aggregate tumbles over and resets for the new window.

Query example:

1
2
3
4
5
6
7
8
9
10
11
12
SELECT 
	DateAdd(day,-1,System.TimeStamp) AS 'WinStartTime',
	System.TimeStamp AS 'WinEndTime',
	COUNT([ReferralId]) AS 'SenorValue'
INTO
	[DashboardOutput]
FROM 
	[SystemInput]
TIMESTAMP BY 
	[CreatedDateTime]
GROUP BY
	TumblingWindow(Duration(day, 1), Offset(millisecond, -1)) --window criteria

Scenario:

We are monitoring referrals being entered onto our hospitals computer system with a tumbling window of 1 day. Each day we count the number of referrals entered and display the information on a real-time dashboard. Triggers could then be put in place if the referrals entered fall below a given tolerance. Or if they exceed expected amounts because data isn’t being entered uniformly throughout the week.

Hoping

Hoping windows get a little easier to understand assuming we are comfortable with tumbling windows. The reason being is that a hop is the same as a tumble, but with an overlap in time. Sticking with our 24 hour day we could still have 24 hoping windows, but they now have a size or duration of 1.5hours meaning a 30 minute overlap.

Query example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
SELECT 
	DateAdd(hour,-1,System.TimeStamp) AS 'WinStartTime',
	System.TimeStamp AS 'WinEndTime',
	[WardName],
	COUNT([PatientId]) AS 'SenorValue'
INTO
	[DashboardOutput]
FROM 
	[SystemInput]
TIMESTAMP BY 
	[CreatedDateTime]
GROUP BY
	[WardName],
	HoppingWindow(Duration(hour, 1), Hop(minute, 30), Offset(millisecond, -1)) --window criteria

Scenario:

We are monitoring the number of patients on our wards with a simple count. A patient is then moved between wards but the move involves escorting the patient by wheel chair to the other side of the hospital. We want to allow enough time for the physical move to occur in our window aggregation so we allow a 30 minute overlap in the tumble. This means the patient may be double counted for a brief time. But that might be considered a better position than them not appearing on any ward. Eg. They virtual disappeared.

 

I hope you found this post useful. I did say it was going to be dry. There are a few pictures on the MSDN pages which might help with understanding but for me they were a little static and needed some animation or colour to be properly representative of each window type.

Many thanks for reading.

Paul’s Frog Blog

Paul is a Microsoft Data Platform MVP with 10+ years’ experience working with the complete on premises SQL Server stack in a variety of roles and industries. Now as the Business Intelligence Consultant at Purple Frog Systems has turned his keyboard to big data solutions in the Microsoft cloud. Specialising in Azure Data Lake Analytics, Azure Data Factory, Azure Stream Analytics, Event Hubs and IoT. Paul is also a STEM Ambassador for the networking education in schools’ programme, PASS chapter leader for the Microsoft Data Platform Group – Birmingham, SQL Bits, SQL Relay, SQL Saturday speaker and helper. Currently the Stack Overflow top user for Azure Data Factory. As well as very active member of the technical community.
Thanks for visiting.
@mrpaulandrew