0845 643 64 63

Monthly Archives: March 2022

AutoML; How to use it and why

Introduction

Some of you may have noticed that the Purple Frog team were at the largest community-led SQL Server and Microsoft Data Platform conference in Europe last week; SQL Bits. With all of us learning a lot, I thought I would share some of the knowledge I picked up in the Machine Learning space with you this week. I went to a session by Gavita Regunath of Advancing Analytics on the topic of AutoML, and why not just beginners in the ML space should be using it, but also seasoned pros as a way to get a jump start on model creation and prototyping. In this blog, I will walk you through how to set up and run an AutoML experiment in the Machine Learning Studio of Azure Synapse (although there are many other AutoML options out there) and then briefly summarize the benefits for a beginner and the more experienced.

What is AutoML?

It is what it says on the tin; it’s automatic machine learning. In essence you give the tool the dataset you are using, the kind of question you are asking of that dataset (for Azure Synapse this is classification, regression or time series forecasting) and then the parameter you are interested in. You then run the tool, which will do any pre-processing it deems fit (imputation, scaling etc) and run the data through as many models as it can in the time frame you have given it. When complete it will show you the best model it found with the metrics it used as well as the hyperparameters it used.

How to use AutoML

We shall be working on the presumption that you are already a user of Azure Synapse, so the first step will be to set up a Azure Machine Learning Studio resource. To do this, navigate to the home screen on your Azure Synapse subscription and click ‘Create a resource’.

Then search for Machine Learning and then click the above shown icon which will bring you to the above screen, where you will then want to click create and fill in the following fields. A lot of the fields will already have the necessary components if you are already using Azure Synapse. If not, creating them is straight forward with the hyperlinks below the boxes. Once these are filled in you then need to click ‘Review + create’ and then wait till the Machine Learning resource has been created.

Next, navigate to your resource and launch the studio. We are almost at a point where we can start doing some AutoML! However, we need to create a compute instance to run our AutoML.

You will be met by the home screen of the Machine Learning Studio, and the first thing we need to do is navigate to the manage ribbon and select compute; then new.

For the next step you will be met with the below screen, and I would recommend the following settings (the compute name can be whatever you want as long as its unique). Additionally, you would only need a GPU virtual machine type if you were intending on performing some deep learning, which we will not cover in this blog post.

We then wait till that’s created and you go back to the studio home screen. You then want to select ‘Start now’ on the ‘Automated ML’ square. Then select ‘New Automated ML run’

Next, you will see the below screen. For a first run you can use one of the built in datasets. To do this, click ‘Create dataset’, then ‘from open datasets’.

I will be showing you how to solve a classification problem, and the Diabetes dataset will be the most conducive for this (although peculiarly it doesn’t state if you have diabetes or not, so we will be predicting sex); so select this dataset, press next, give it a name and click create.

You will then be taken back to the screen where we chose to create a dataset, and where we can continue to set up the AutoML experiment. Put a tick next to the dataset you have just set up, and then click next to go to the ‘configure run’ screen.

You will then have the above to be filled out. The target column is what we are trying to predict and in our experiment we are using a compute instance, of which you then need to select the one you created earlier. Upon clicking next you simply want to just select that you will be dealing with a classification problem and on the additional configuration settings change the training job time to 0.5 hours (to save time and resource). On the next screen keep everything as default for this run.

Click finish and the experiment will start. Wait until the experiment completes, then you will be met with the below screen with an output of the best model found.

We see that a voting ensemble algorithm was settled upon with an AUC of 0.7655, which is quite good. If we click on voting ensemble we can get more details (of which I will go through a couple of key points). Firstly, we can click ‘view all other metrics’ to get a holistic view of model performance.

Under the explanations tab you can see the feature importance (I have limited it to the top 4 features).

Finally, in the model tab you can get the details of the model in terms of pre-processing and hyperparameter tuning. In regard to our voting ensemble algorithm that was settled on, we have model details for each individual model which makes up the ensemble.

Why should beginners use AutoML

As I’ve shown above, the actual process of creating a model using AutoML requires no knowledge of machine learning to get what can be a respectable and viable model, which you can also deploy from the same environment you created it in. The ability to read into what hyper parameter settings have been used as well as pre-processing allows you to analyze what the tool has done and learn from this.

Why should experienced people use AutoML

In short, it can save you a lot of time and give you a head start in your model creation process. For example, in this blog I have shown that in the space of half an hour the AutoML tool tried 51 different model, hyperparameter and pre processing combinations to arrive at its best model. This is something even the most seasoned data scientist would take a lot longer to do. In addition to this, as it provides you with info on the final model, you can take this away and further refine it to your specific needs.

Summary

In closing, I hope I have given you a solid guide in how to use the AutoML capabilities that Azure Synapse provides. I also hope I have convinced you that AutoML in general is a tool that can and should be used by all regardless of ability and knowledge of ML.

Combining Queries from Multiple Sources in Power BI using Merge and Append

It is always good practice to do as much data preparation as close to the sources as you can before importing or connecting them to your Power BI reports, but what if there are circumstances where this isn’t possible?

I had an issue recently where a third-party application had been updated and both the new and legacy versions were being used side-by-side. Logging data from both versions was being written to two separate Azure SQL databases.

The customer needed a Power BI report showing both old and new logging data sets as a single source. If both databases were SQL Server databases, I could have written a view with a cross-database join, imported that into Power BI and thought no more about it. However, the two sources being Azure SQL Databases, with no easy way to join the tables, caused an issue.

This is where the Merge and Append functions in Power Query come in.

The first step to solve my issue was to create a view in each of the Azure databases that were identical in structure (you’ll see why later on) and import these into Power BI.

Now these data sources (referred to as ‘Queries’ by Power BI) have both been imported into the data model, we have two options with regards to how to combine them, ‘Append’ and ‘Merge’.

Merge
Although I didn’t use the ‘Merge’ function, I have included some information about it here as it is still relevant. ‘Merge’ is useful when you have columns from one source that you would like to add to another, the simplest way to think about it is that it works in the same way a JOIN works in SQL, in fact, when you enter the ‘Merge’ wizard there is a ‘Join Kind’ option:

This is how the ‘Merge’ function works:

Append
To solve my issue (explained above), I used the ‘Append’ function. While still combining 2 sources it concatenates one query with another, it’s SQL equivalent would be a UNION ALL. ‘Append’ gives you the option to combine as many tables as you wish, regardless of structure. If a column exists in one query but not another, the column will be filled with NULLS where applicable, this is why it was important to create the two identical views at the very beginning.

This is how the ‘Append’ function works:

When I appended the two sources, I chose the option to create ‘as New’ so I can hide the original 2 queries and I have a nice new table (as seen below) which I can rename, tidy up and make a bit more user friendly, ready for the report consumer to use.

As previously mentioned, data preparation should be done as close to the source as possible, but in situations where this is difficult or simply isn’t possible, it’s important to know about these Power BI functions and how you can use them to your advantage.


More information can be found here:
Append queries – Power Query | Microsoft Docs
Merge queries overview – Power Query | Microsoft Docs

Power BI Sentinel
The Frog Blog

Team Purple Frog specialise in designing and implementing Microsoft Data Analytics solutions, including Data Warehouses, Cubes, SQL Server, SSIS, ADF, SSAS, Power BI, MDX, DAX, Machine Learning and more.

This is a collection of thoughts, ramblings and ideas that we think would be useful to share.

Authors:

Alex Whittles
(MVP)
Jeet Kainth
Jon Fletcher
Nick Edwards
Joe Billingham
Lewis Prince
Reiss McSporran
Microsoft Gold Partner

Data Platform MVP

Power BI Sentinel
Frog Blog Out
twitter
rssicon