AutoML; How to use it and why
Some of you may have noticed that the Purple Frog team were at the largest community-led SQL Server and Microsoft Data Platform conference in Europe last week; SQL Bits. With all of us learning a lot, I thought I would share some of the knowledge I picked up in the Machine Learning space with you this week. I went to a session by Gavita Regunath of Advancing Analytics on the topic of AutoML, and why not just beginners in the ML space should be using it, but also seasoned pros as a way to get a jump start on model creation and prototyping. In this blog, I will walk you through how to set up and run an AutoML experiment in the Machine Learning Studio of Azure Synapse (although there are many other AutoML options out there) and then briefly summarize the benefits for a beginner and the more experienced.
What is AutoML?
It is what it says on the tin; it’s automatic machine learning. In essence you give the tool the dataset you are using, the kind of question you are asking of that dataset (for Azure Synapse this is classification, regression or time series forecasting) and then the parameter you are interested in. You then run the tool, which will do any pre-processing it deems fit (imputation, scaling etc) and run the data through as many models as it can in the time frame you have given it. When complete it will show you the best model it found with the metrics it used as well as the hyperparameters it used.
How to use AutoML
We shall be working on the presumption that you are already a user of Azure Synapse, so the first step will be to set up a Azure Machine Learning Studio resource. To do this, navigate to the home screen on your Azure Synapse subscription and click ‘Create a resource’.
Then search for Machine Learning and then click the above shown icon which will bring you to the above screen, where you will then want to click create and fill in the following fields. A lot of the fields will already have the necessary components if you are already using Azure Synapse. If not, creating them is straight forward with the hyperlinks below the boxes. Once these are filled in you then need to click ‘Review + create’ and then wait till the Machine Learning resource has been created.
Next, navigate to your resource and launch the studio. We are almost at a point where we can start doing some AutoML! However, we need to create a compute instance to run our AutoML.
You will be met by the home screen of the Machine Learning Studio, and the first thing we need to do is navigate to the manage ribbon and select compute; then new.
For the next step you will be met with the below screen, and I would recommend the following settings (the compute name can be whatever you want as long as its unique). Additionally, you would only need a GPU virtual machine type if you were intending on performing some deep learning, which we will not cover in this blog post.
We then wait till that’s created and you go back to the studio home screen. You then want to select ‘Start now’ on the ‘Automated ML’ square. Then select ‘New Automated ML run’
Next, you will see the below screen. For a first run you can use one of the built in datasets. To do this, click ‘Create dataset’, then ‘from open datasets’.
I will be showing you how to solve a classification problem, and the Diabetes dataset will be the most conducive for this (although peculiarly it doesn’t state if you have diabetes or not, so we will be predicting sex); so select this dataset, press next, give it a name and click create.
You will then be taken back to the screen where we chose to create a dataset, and where we can continue to set up the AutoML experiment. Put a tick next to the dataset you have just set up, and then click next to go to the ‘configure run’ screen.
You will then have the above to be filled out. The target column is what we are trying to predict and in our experiment we are using a compute instance, of which you then need to select the one you created earlier. Upon clicking next you simply want to just select that you will be dealing with a classification problem and on the additional configuration settings change the training job time to 0.5 hours (to save time and resource). On the next screen keep everything as default for this run.
Click finish and the experiment will start. Wait until the experiment completes, then you will be met with the below screen with an output of the best model found.
We see that a voting ensemble algorithm was settled upon with an AUC of 0.7655, which is quite good. If we click on voting ensemble we can get more details (of which I will go through a couple of key points). Firstly, we can click ‘view all other metrics’ to get a holistic view of model performance.
Under the explanations tab you can see the feature importance (I have limited it to the top 4 features).
Finally, in the model tab you can get the details of the model in terms of pre-processing and hyperparameter tuning. In regard to our voting ensemble algorithm that was settled on, we have model details for each individual model which makes up the ensemble.
Why should beginners use AutoML
As I’ve shown above, the actual process of creating a model using AutoML requires no knowledge of machine learning to get what can be a respectable and viable model, which you can also deploy from the same environment you created it in. The ability to read into what hyper parameter settings have been used as well as pre-processing allows you to analyze what the tool has done and learn from this.
Why should experienced people use AutoML
In short, it can save you a lot of time and give you a head start in your model creation process. For example, in this blog I have shown that in the space of half an hour the AutoML tool tried 51 different model, hyperparameter and pre processing combinations to arrive at its best model. This is something even the most seasoned data scientist would take a lot longer to do. In addition to this, as it provides you with info on the final model, you can take this away and further refine it to your specific needs.
In closing, I hope I have given you a solid guide in how to use the AutoML capabilities that Azure Synapse provides. I also hope I have convinced you that AutoML in general is a tool that can and should be used by all regardless of ability and knowledge of ML.