0845 643 64 63

lewis prince

Deploy Azure AutoML Model And Consume In Excel

It has come back to my turn to write a blog post, and if you remember my previous one concerned why you should use Azure based AutoMl and subsequently how to do so. If you followed that then you will be left with a model of which you’ve scored and know the performance of, but no way of how to then deploy and use your model. I will outline the steps needed to do this (which involves a major shortcut as we are using an AutoMl model), and then show you the required VBA needed to consume this in Microsoft Excel.

Create an endpoint

The first step is to navigate to your selected model in Azure Machine Learning Studio, click the dropdown on the deploy tab and select deploy to a web service.

You will then be met by the below screen which is where we really benefit from using an AutoMl model! If this was not an AutoMl model, we would have two other fields here asking for an entry script file and a Conda dependencies file. The point of these being to allow whatever is calling the model to know how to use it and also what packages it needs to run it. Luckily, AutoMl has already done this for you, so all you need to do is give the deployment a name (no spaces, special characters or spaces allowed) and then for ease of use go for an Azure Container Instance for Compute type. Then click deploy!

Then navigate to the ribbon on the left of your screen and select endpoints to see your end point. Open this and you will see it will have a deployment state of ‘Transitioning’ for a while and when this is ‘Healthy’ you are ready to go. When ready, you will see a few tabs at the top of the page (under the deployments name) the main ones are ‘Test’ where you can manually enter values into a JSON format to get a result, and ‘Consume’ where you can see the REST endpoint you have created as well as useful code snippets on how you can implement your model in C#, Python and R; but not Excel which will be more familiar to your average data consumer.

How to consume in Excel

Before we write any VBA we need to make a table in a spreadsheet for the data to be entered into and results to outputted to. If you go back to the ‘Test’ tab, it’s useful to look at the JSON to see what order the features will be read as they must match the order shown in the JSON.

You can then create a table which mirrors this.

You then need to write some VBA to link these elements up, so you firstly have to make a VBA module. To do this you need to open the VBA GUI which is accessed via the developer tab and clicking ‘Visual Basic’.

You then need to click the VBA project with the name of your excel workbook , then click the insert tab and select ‘Module’. This is where you will write your VBA which we will do in two sections. The first being a function which performs the HTTP request to the end point. We want it to take our Features and the REST endpoint and make a HTTP request with these and store the response. The second will be a sub that needs to create the JSON we saw in the ‘Test’ tab earlier, as this is how the endpoint needs the input data packaged, and then give this to your function to send to your model.

Function RunScore(Features, RestEndpoint)

    If WEBURL <> "" Then
        Url = RestEndpoint
        Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP")
        objHTTP.Open "POST", Url, False
        objHTTP.setRequestHeader "Content-type", "application/json"
        objHTTP.send (Features)
        RunScore = objHTTP.ResponseText
    '    MsgBox (resp)
    Else
     MsgBox "Not WEB URL Provided"
    End If
    
End Function
Sub Score()
    
    On Error GoTo Error:
    
    Range("B2").Select
    NumRows = Range(Selection, Selection.End(xlDown)).Rows.Count
    
    For r = 2 To 2 + NumRows
    
    
        body = "{" & Chr(34) & "Inputs" & Chr(34) & ":{" & Chr(34) & "data" &            Chr(34) & ": [[" & Range("B" & r) & "," & Range("C" & r) & "," & Range("D" & r) & "," & Range("E" & r) & "]]}," & Chr(34) & "GlobalParameters" & Chr(34) & ": 1.0}"
    
        WEBURL = http://999def99-9e9d-999e-b99a-aa1c12820857.uksouth.azurecontainer.io/score
        
       
        
        ret = RunScore(Features, RestEndpoint)
        
        openPos = InStr(ret, "[")
        closePos = InStr(ret, "]")
        midbit = Mid(ret, openPos + 1, closePos - openPos - 1)
        midbit = Round(midbit, 2)
        
        Range("F" & r) = midbit
        
    Next r


Exit Sub
Error:
      MsgBox (Err.Description)
End Sub

Ignoring anything that mentions error (this is just a wrapper that tells you what’s gone wrong, if something does go wrong), we are selecting the first cell where there should be data and then finding out how many rows of data there are. We then loop through the rows creating the JSON for each row, this is stored in body. It is essential that you use the string exactly as it is above, only editing the cell locations to match the length and order of your data, the features must be in the order that they are shown in the JSON we looked at earlier. Along with storing the Rest Endpoint in a variable, we then call the function we made earlier, the output of which we have to do a bit of manipulation on as it will be wrapped in brackets etc. We then place the value we have extracted into the result location we created in the spreadsheet earlier.

We now need a button in the spreadsheet to allow us to run the code ‘at the click of a button’ to excuse the pun. This can be done via the developer tab in excel where you need to click insert and select a button. The button needs to be placed via a click and drag action in the spreadsheet exactly like how you insert shapes in excel. Upon releasing your mouse button a menu will pop up listing the macros you have in the spreadsheet, select ‘Score’ (or however you have named it), which will assign the macro to this button.

It is now ready to go, you can insert the features you want values for (on a row by row basis) and when you have entered the features in the table, you simply click the button we created and the model will be run and you will see the results outputted in your table!

AutoML; How to use it and why

Introduction

Some of you may have noticed that the Purple Frog team were at the largest community-led SQL Server and Microsoft Data Platform conference in Europe last week; SQL Bits. With all of us learning a lot, I thought I would share some of the knowledge I picked up in the Machine Learning space with you this week. I went to a session by Gavita Regunath of Advancing Analytics on the topic of AutoML, and why not just beginners in the ML space should be using it, but also seasoned pros as a way to get a jump start on model creation and prototyping. In this blog, I will walk you through how to set up and run an AutoML experiment in the Machine Learning Studio of Azure Synapse (although there are many other AutoML options out there) and then briefly summarize the benefits for a beginner and the more experienced.

What is AutoML?

It is what it says on the tin; it’s automatic machine learning. In essence you give the tool the dataset you are using, the kind of question you are asking of that dataset (for Azure Synapse this is classification, regression or time series forecasting) and then the parameter you are interested in. You then run the tool, which will do any pre-processing it deems fit (imputation, scaling etc) and run the data through as many models as it can in the time frame you have given it. When complete it will show you the best model it found with the metrics it used as well as the hyperparameters it used.

How to use AutoML

We shall be working on the presumption that you are already a user of Azure Synapse, so the first step will be to set up a Azure Machine Learning Studio resource. To do this, navigate to the home screen on your Azure Synapse subscription and click ‘Create a resource’.

Then search for Machine Learning and then click the above shown icon which will bring you to the above screen, where you will then want to click create and fill in the following fields. A lot of the fields will already have the necessary components if you are already using Azure Synapse. If not, creating them is straight forward with the hyperlinks below the boxes. Once these are filled in you then need to click ‘Review + create’ and then wait till the Machine Learning resource has been created.

Next, navigate to your resource and launch the studio. We are almost at a point where we can start doing some AutoML! However, we need to create a compute instance to run our AutoML.

You will be met by the home screen of the Machine Learning Studio, and the first thing we need to do is navigate to the manage ribbon and select compute; then new.

For the next step you will be met with the below screen, and I would recommend the following settings (the compute name can be whatever you want as long as its unique). Additionally, you would only need a GPU virtual machine type if you were intending on performing some deep learning, which we will not cover in this blog post.

We then wait till that’s created and you go back to the studio home screen. You then want to select ‘Start now’ on the ‘Automated ML’ square. Then select ‘New Automated ML run’

Next, you will see the below screen. For a first run you can use one of the built in datasets. To do this, click ‘Create dataset’, then ‘from open datasets’.

I will be showing you how to solve a classification problem, and the Diabetes dataset will be the most conducive for this (although peculiarly it doesn’t state if you have diabetes or not, so we will be predicting sex); so select this dataset, press next, give it a name and click create.

You will then be taken back to the screen where we chose to create a dataset, and where we can continue to set up the AutoML experiment. Put a tick next to the dataset you have just set up, and then click next to go to the ‘configure run’ screen.

You will then have the above to be filled out. The target column is what we are trying to predict and in our experiment we are using a compute instance, of which you then need to select the one you created earlier. Upon clicking next you simply want to just select that you will be dealing with a classification problem and on the additional configuration settings change the training job time to 0.5 hours (to save time and resource). On the next screen keep everything as default for this run.

Click finish and the experiment will start. Wait until the experiment completes, then you will be met with the below screen with an output of the best model found.

We see that a voting ensemble algorithm was settled upon with an AUC of 0.7655, which is quite good. If we click on voting ensemble we can get more details (of which I will go through a couple of key points). Firstly, we can click ‘view all other metrics’ to get a holistic view of model performance.

Under the explanations tab you can see the feature importance (I have limited it to the top 4 features).

Finally, in the model tab you can get the details of the model in terms of pre-processing and hyperparameter tuning. In regard to our voting ensemble algorithm that was settled on, we have model details for each individual model which makes up the ensemble.

Why should beginners use AutoML

As I’ve shown above, the actual process of creating a model using AutoML requires no knowledge of machine learning to get what can be a respectable and viable model, which you can also deploy from the same environment you created it in. The ability to read into what hyper parameter settings have been used as well as pre-processing allows you to analyze what the tool has done and learn from this.

Why should experienced people use AutoML

In short, it can save you a lot of time and give you a head start in your model creation process. For example, in this blog I have shown that in the space of half an hour the AutoML tool tried 51 different model, hyperparameter and pre processing combinations to arrive at its best model. This is something even the most seasoned data scientist would take a lot longer to do. In addition to this, as it provides you with info on the final model, you can take this away and further refine it to your specific needs.

Summary

In closing, I hope I have given you a solid guide in how to use the AutoML capabilities that Azure Synapse provides. I also hope I have convinced you that AutoML in general is a tool that can and should be used by all regardless of ability and knowledge of ML.

Pandas; Why Use It And How To Do So!

Introduction

Hi and welcome to what will be my first frog blog! I’m Lewis Prince, a new addition to the Purple Frog team who has come on board as a Machine Learning Developer. My skill set resides mainly in Data Science and Statistics, and using Python and R to apply these. Therefore my blogs will be primarily on hints and tips on performing Data Science and Statistics through the medium of Python (and possibly R).

I thought I would start my blog with what I feel is a very important library for anyone who wants to perform any data manipulation and/or analysis with large datasets in Python; Pandas! In this blog post I will firstly go over what Pandas is and why you should you be using it, as well as introducing and showing you how to use some of the most useful commands and features of the library.

For reference I will show you how to create a pandas data frame, add data to it (append), index it, get unique values, drop duplicate rows and values, filter the data frame, ascertain it’s shape and size and finally how to change data types of and within your data frame.

What is Pandas and why should you use it?

By definition Pandas is described as a open source library for Python which can be used for data analysis, cleaning, exploring and manipulation, especially when manipulating numerical tables and time series. Its main power comes from the fact that it is able to perform all the ‘major steps’ when wanting to process and analyse data:

  • Load
  • Manipulate
  • Prepare
  • Model
  • Analyse

It does these via supporting the import of various file types such as CSV, XLSX, XML, JSON and ZIP to name a few, as well as allowing you to perform data manipulations such as group by, join, merge, melt and data cleaning operations such as replacements or imputation. This combination is where the power of pandas lies; in the ease of which you can perform these operations when having your data in a Pandas data frame when compared to performing them in ‘base’ Python. I hope to show this in the later sections of this blog post where I go through functionality and commands of Pandas which I believe are worth knowing.

The cherry on top of the cake that is Pandas is the streamlined data representation that Pandas allows with its tabular output, which enables users to examine and interpret their data in a more effective manner. An example of this can be seen below through the use of a csv file containing a random dataset on children’s behaviour and whether they got on Santa’s good list:

Non panda data frame
Panda data frame

The code below which outputted the tables above shows how much easier it is to import and read files programmatically in pandas (non panda code first and then panda code second):

Import csv
with open('Naughty list.csv,'r') as file:
     reader = csv.reader(file)
     for row in reader:
          print(row)
Import pandas as pd
df = pd.read_csv (r'Naughty list.csv')
print (df)

To summarize this section, you should be using pandas as it allows you to do a lot more with less code, and does so with an output which is easy to read and would be familiar to people from many backgrounds.

Important commands and functionality

Now I’ve explained why you should be using Pandas, I’ll now go into what I think are the fundamentals in using Pandas. It should be noted that in all code examples I will have presumed that you have already imported Pandas as pd like below in whatever python environment you are using:

Import pandas as pd

How to set up a Pandas data frame (and then append)

There are three ways to do this depending on whether you are importing the data from external sources, using a data source already in your environment or whether you will be adding to the panda data frame via process (a for loop for example). The first of these methods can be completed within one line (changing the function and naming convention for the file and file type you are using):

df = pd.read_csv(r'whatever_your_file_is.csv')

This is then usable and readable using such functions as print() and .head().

If you already have pre-existing data that you wish to convert to a python data frame then there is slightly more manipulation needed as well as the need to create a python dictionary (if you’re not familiar with these then check out this quick guide https://www.w3schools.com/python/python_dictionaries.asp):

Data = {'Column1' : SourceData[:,0], 'Column2' : SourceData[:,1], 'Column3' : SourceData[:,2]}

PandaData = pd.Dataframe(Data)

In the above we have set up a dictionary where we state the column names that we will be using in the data frame and then allocating the relevant data from the pre-existing data you have to these column headers. This is then used with the pandas data frame command to create the panda data frame.

Finally, if you want to set up a blank pandas data frame you can do that using the example below:

PandaData = pd.DataFrame(columns =[' Column1', ' Column2', ' Column3',' Column4'])

This command will set up an empty data frame with the column headings you specify; of which you can then append any data to row by row or in bulk. I will explain how to do this in the next section.

How to append to a Pandas data frame

So you’ve got your Pandas data frame set up, but you want to add data to it. The process is the same whether you want to add just one row or many (as long as all the columns you are adding have the same amount of rows, and you have a row for each column) . Once again we will use a dictionary to collate the data and then append to the data frame like below:

New_Row = {'Column1':Data,'Column2':Data,'Column3':Data}

PandaData = PandaDate.append(New_Row,ignore_index = True

You should note setting the ‘ignore_index’ parameter to ‘True’ in the above code.  This ensures that you have a continuous index in your data frame, if this was turned off then when appending data you could have a non continuous index as the new rows would carry over the index they had from its source location. There may be some situations where you may want to do this though, for example if you wanted to keep a record of where in the source data your data came from.

How to index a Pandas data frame

To bring back whole columns from a Pandas data frame you can do so in three separate ways, the first two are very similar and the third is how you would index individual cells and rows in the more recognizable bracket indexing. The first two methods are very similar and use the column names you have set up within your Pandas data frame:

PandaData.Column1

PandaData['Column1'}

The above would bring back all data within the specified column. The third option is pandas version of bracket indexing which has to be performed using the .iloc command (integer based location), it can be used as follows:

PandaData.iloc[0,1]

The above will return you the value located in the first row of the second column of the data frame, which highlights two important points of how the function works. The first point being that the first number in the square brackets denotes the row location and the second number the column location, and the second point is that index numbering starts from 0. So row 1, 2 and 3 would be denoted as 0, 1 and 2 for example. Further uses of the command can bring you all the data in a column or row:

PandaData.iloc[:,1] will bring everything in the second column

PandaData.iloc[0,:] will bring everything in the first row

You can also combine all three methods to bring out certain elements of a column:

PandaData.Column1.iloc[1]

PandaData['Column1'].iloc[1]

Both of the above will bring you the 2nd element of ‘Column1’.

Getting unique variables and dropping duplicates

.Unique() and .drop_duplicates() are two fairly useful functions for data cleaning and exploration. They do similar but distinctly different things in returning you a list or an array of unique values or rows. The .Unique() function takes a one dimensional set of values, be it a list of indexes, categorical variables or a series, and outputs a numpy array of the same type (it will not sort them though). For example:

pd.unique(list('asdasda'))

Would return:

array([‘a’,’s’,’d’])

This function should be used for getting unique values from columns/rows within your panda data frames, but can only be used on subsections of data.

If you want to get all unique rows in a data frame then you can use .drop_duplicates(). This function takes an input of a data frame and outputs another data frame with all duplicate rows removed. For this example we will need to add some data into our data frame first:

PandaData = pd.DataFrame({
    'Column1': ['England','England','England', 'Scotland', 'Ireland'],
    'Column2': ['cup','cup', 'ball', 'square', 'bowl'],
    'Column3': [4, 4, 4, 3.5, 15, 5]})

This will give the data frame as follows:

   Column1  Column2 Column3
0  England  cup     4.0
1  England  cup     4.0
2  England  ball    3.5
3  Scotland square  15.0
4  Ireland  bowl    5.0

You can then use the following code to remove duplicate rows

PandaData.drop_duplicates()
   Column1  Column2 Column3
0  England  cup     4.0
2  England  ball    3.5
3  Scotland square  15.0
4  Ireland  bowl    5.0

This will give you a list of unique whole rows, so you may still have duplicates within columns.

You can apply this to certain rows within a data frame as well. The output of this will still be a data frame, but rows will be removed to leave only the first instance of each variable within the designated column.

This is can then be dropped using the following code:

df.drop_duplicates(subset=['Column1'])

Which gives the following table:

   Column1  Column2 Column3
0  England  cup     4.0
2  Scotland square  15.0
3  Ireland  bowl    5.0

Filtering a data frame

Filtering a dataset to get what you want is somewhat of an essential skill in data exploration and analysis, and something which is simplified in the Pandas data frame. I will firstly show the simplest form of filtering which uses your basic programmatic conditioning symbols (+, -, <, >, ==, != and LIKE for example) to perform filtering. Let’s take the table we used above (without the duplicate row) and filter that to just bring back rows which involve Scotland:

PandaData[PandaData.Column1 == 'Scotland']

Note the use of ‘==’ as opposed to ‘=’, for those familiar with python you will already know the reasoning for this denoting a check of equality, as ‘=’ is used for saving variables to some form of data frame etc.

   Column1  Column2 Column3
2  Scotland square  15.0

You can add multiple conditions to this applying to different or the same columns, but each condition needs to be in circular brackets with an ‘&’ symbol between each condition if using more than one condition:

PandaData[(PandaData.Column1 == 'England') & (PandaData.Column3 < 3.9)]

This will bring back the below:

   Column1  Column2 Column3
1  England  ball    3.5

If you are filtering on the same column you can also consolidate your conditions into one condition as opposed to multiple via using the isin() function and supplying a list of conditions , which can be seen below:

PandaData[PandaData.Column1.isin(['Scotland','Ireland'])]

This gives the below output:

   Column1  Column2 Column3
2  Scotland square  15.0
3  Ireland  bowl    5.0

Getting the size of a Pandas data frame

Sometimes it is useful to know the size of a data frame you have, whether this be for validation purposes or for analytic work where your data frame changes size dynamically through your scripts and size is needed to be known elsewhere in the script. Pandas has its own set of two commands which give you the ‘size’ and ‘shape’ of your data frame. Both are used in the same way though and will be used on the same random PandaData data frame I have been using above:

PandaData.size()

OR

PandaData.shape()

The outputs of the above commands will give us ’12’ for the size command and ‘(3,4)’ for the shape command. This shows the distinction between what the two commands do. ‘Size’ will give you the total number of values within the data frame (rows * columns), while ‘shape’ will give the dimensions of the data frame (number of rows and number of columns). A further extension of the shape() command, that I personally find useful (for example in declaring limits of a range for a ‘for’ loop), is indexing its output:

PandaData.shape()[0]

This would output the number of rows in your data frame, while using a 1 instead of a 0 will give the number of columns; anything more than that will cause a subset out of range error.

Changing data types

It is sometimes useful to change the data types of either all the data in your data frame or just certain columns. An example of a situation requiring this to be done this occurred to me recently when I need to make a unique data key out of two columns containing year and week number. These two columns were ‘integers’, and so could not be concatenated as required. In these instances you can make use of the as.type() command:

PandaData = PandaData.astype({'Year':'string_', 'WeekNumber':'string_' })

Once again the input takes the form of a dictionary where you need to specify the data type you want, and you also need to save the output to a variable; most commonly you would save it to itself like I have above.

For reference, this website shows the different datatypes available quite well: https://bit.ly/3GoELAo

As I stated in the first paragraph you can also apply the function to the whole data frame like so:

PandaData = PandaData.astype('string_')

Summary

I hope that I have shown you why Pandas is a very useful library and one you should be using. Hopefully, you find the content of this blog post useful, whether that be in starting out with Pandas, or a fixing a problem as a common user. If you have any questions for me, then please put them in the comments section. My next blog post is set to come out March 17th 2022, so if you have any ideas of what you’d like to see in regard to Python and Data Science knowledge then please put them in the comments as well.

Power BI Sentinel
The Frog Blog

Team Purple Frog specialise in designing and implementing Microsoft Data Analytics solutions, including Data Warehouses, Cubes, SQL Server, SSIS, ADF, SSAS, Power BI, MDX, DAX, Machine Learning and more.

This is a collection of thoughts, ramblings and ideas that we think would be useful to share.

Authors:

Alex Whittles
(MVP)
Jeet Kainth
Jon Fletcher
Nick Edwards
Joe Billingham
Lewis Prince
Reiss McSporran
Microsoft Gold Partner

Data Platform MVP

Power BI Sentinel
Frog Blog Out
twitter
rssicon