0845 643 64 63

Big Data

Pandas; Why Use It And How To Do So!


Hi and welcome to what will be my first frog blog! I’m Lewis Prince, a new addition to the Purple Frog team who has come on board as a Machine Learning Developer. My skill set resides mainly in Data Science and Statistics, and using Python and R to apply these. Therefore my blogs will be primarily on hints and tips on performing Data Science and Statistics through the medium of Python (and possibly R).

I thought I would start my blog with what I feel is a very important library for anyone who wants to perform any data manipulation and/or analysis with large datasets in Python; Pandas! In this blog post I will firstly go over what Pandas is and why you should you be using it, as well as introducing and showing you how to use some of the most useful commands and features of the library.

For reference I will show you how to create a pandas data frame, add data to it (append), index it, get unique values, drop duplicate rows and values, filter the data frame, ascertain it’s shape and size and finally how to change data types of and within your data frame.

What is Pandas and why should you use it?

By definition Pandas is described as a open source library for Python which can be used for data analysis, cleaning, exploring and manipulation, especially when manipulating numerical tables and time series. Its main power comes from the fact that it is able to perform all the ‘major steps’ when wanting to process and analyse data:

  • Load
  • Manipulate
  • Prepare
  • Model
  • Analyse

It does these via supporting the import of various file types such as CSV, XLSX, XML, JSON and ZIP to name a few, as well as allowing you to perform data manipulations such as group by, join, merge, melt and data cleaning operations such as replacements or imputation. This combination is where the power of pandas lies; in the ease of which you can perform these operations when having your data in a Pandas data frame when compared to performing them in ‘base’ Python. I hope to show this in the later sections of this blog post where I go through functionality and commands of Pandas which I believe are worth knowing.

The cherry on top of the cake that is Pandas is the streamlined data representation that Pandas allows with its tabular output, which enables users to examine and interpret their data in a more effective manner. An example of this can be seen below through the use of a csv file containing a random dataset on children’s behaviour and whether they got on Santa’s good list:

Non panda data frame
Panda data frame

The code below which outputted the tables above shows how much easier it is to import and read files programmatically in pandas (non panda code first and then panda code second):

Import csv
with open('Naughty list.csv,'r') as file:
     reader = csv.reader(file)
     for row in reader:
Import pandas as pd
df = pd.read_csv (r'Naughty list.csv')
print (df)

To summarize this section, you should be using pandas as it allows you to do a lot more with less code, and does so with an output which is easy to read and would be familiar to people from many backgrounds.

Important commands and functionality

Now I’ve explained why you should be using Pandas, I’ll now go into what I think are the fundamentals in using Pandas. It should be noted that in all code examples I will have presumed that you have already imported Pandas as pd like below in whatever python environment you are using:

Import pandas as pd

How to set up a Pandas data frame (and then append)

There are three ways to do this depending on whether you are importing the data from external sources, using a data source already in your environment or whether you will be adding to the panda data frame via process (a for loop for example). The first of these methods can be completed within one line (changing the function and naming convention for the file and file type you are using):

df = pd.read_csv(r'whatever_your_file_is.csv')

This is then usable and readable using such functions as print() and .head().

If you already have pre-existing data that you wish to convert to a python data frame then there is slightly more manipulation needed as well as the need to create a python dictionary (if you’re not familiar with these then check out this quick guide https://www.w3schools.com/python/python_dictionaries.asp):

Data = {'Column1' : SourceData[:,0], 'Column2' : SourceData[:,1], 'Column3' : SourceData[:,2]}

PandaData = pd.Dataframe(Data)

In the above we have set up a dictionary where we state the column names that we will be using in the data frame and then allocating the relevant data from the pre-existing data you have to these column headers. This is then used with the pandas data frame command to create the panda data frame.

Finally, if you want to set up a blank pandas data frame you can do that using the example below:

PandaData = pd.DataFrame(columns =[' Column1', ' Column2', ' Column3',' Column4'])

This command will set up an empty data frame with the column headings you specify; of which you can then append any data to row by row or in bulk. I will explain how to do this in the next section.

How to append to a Pandas data frame

So you’ve got your Pandas data frame set up, but you want to add data to it. The process is the same whether you want to add just one row or many (as long as all the columns you are adding have the same amount of rows, and you have a row for each column) . Once again we will use a dictionary to collate the data and then append to the data frame like below:

New_Row = {'Column1':Data,'Column2':Data,'Column3':Data}

PandaData = PandaDate.append(New_Row,ignore_index = True

You should note setting the ‘ignore_index’ parameter to ‘True’ in the above code.  This ensures that you have a continuous index in your data frame, if this was turned off then when appending data you could have a non continuous index as the new rows would carry over the index they had from its source location. There may be some situations where you may want to do this though, for example if you wanted to keep a record of where in the source data your data came from.

How to index a Pandas data frame

To bring back whole columns from a Pandas data frame you can do so in three separate ways, the first two are very similar and the third is how you would index individual cells and rows in the more recognizable bracket indexing. The first two methods are very similar and use the column names you have set up within your Pandas data frame:



The above would bring back all data within the specified column. The third option is pandas version of bracket indexing which has to be performed using the .iloc command (integer based location), it can be used as follows:


The above will return you the value located in the first row of the second column of the data frame, which highlights two important points of how the function works. The first point being that the first number in the square brackets denotes the row location and the second number the column location, and the second point is that index numbering starts from 0. So row 1, 2 and 3 would be denoted as 0, 1 and 2 for example. Further uses of the command can bring you all the data in a column or row:

PandaData.iloc[:,1] will bring everything in the second column

PandaData.iloc[0,:] will bring everything in the first row

You can also combine all three methods to bring out certain elements of a column:



Both of the above will bring you the 2nd element of ‘Column1’.

Getting unique variables and dropping duplicates

.Unique() and .drop_duplicates() are two fairly useful functions for data cleaning and exploration. They do similar but distinctly different things in returning you a list or an array of unique values or rows. The .Unique() function takes a one dimensional set of values, be it a list of indexes, categorical variables or a series, and outputs a numpy array of the same type (it will not sort them though). For example:


Would return:


This function should be used for getting unique values from columns/rows within your panda data frames, but can only be used on subsections of data.

If you want to get all unique rows in a data frame then you can use .drop_duplicates(). This function takes an input of a data frame and outputs another data frame with all duplicate rows removed. For this example we will need to add some data into our data frame first:

PandaData = pd.DataFrame({
    'Column1': ['England','England','England', 'Scotland', 'Ireland'],
    'Column2': ['cup','cup', 'ball', 'square', 'bowl'],
    'Column3': [4, 4, 4, 3.5, 15, 5]})

This will give the data frame as follows:

   Column1  Column2 Column3
0  England  cup     4.0
1  England  cup     4.0
2  England  ball    3.5
3  Scotland square  15.0
4  Ireland  bowl    5.0

You can then use the following code to remove duplicate rows

   Column1  Column2 Column3
0  England  cup     4.0
2  England  ball    3.5
3  Scotland square  15.0
4  Ireland  bowl    5.0

This will give you a list of unique whole rows, so you may still have duplicates within columns.

You can apply this to certain rows within a data frame as well. The output of this will still be a data frame, but rows will be removed to leave only the first instance of each variable within the designated column.

This is can then be dropped using the following code:


Which gives the following table:

   Column1  Column2 Column3
0  England  cup     4.0
2  Scotland square  15.0
3  Ireland  bowl    5.0

Filtering a data frame

Filtering a dataset to get what you want is somewhat of an essential skill in data exploration and analysis, and something which is simplified in the Pandas data frame. I will firstly show the simplest form of filtering which uses your basic programmatic conditioning symbols (+, -, <, >, ==, != and LIKE for example) to perform filtering. Let’s take the table we used above (without the duplicate row) and filter that to just bring back rows which involve Scotland:

PandaData[PandaData.Column1 == 'Scotland']

Note the use of ‘==’ as opposed to ‘=’, for those familiar with python you will already know the reasoning for this denoting a check of equality, as ‘=’ is used for saving variables to some form of data frame etc.

   Column1  Column2 Column3
2  Scotland square  15.0

You can add multiple conditions to this applying to different or the same columns, but each condition needs to be in circular brackets with an ‘&’ symbol between each condition if using more than one condition:

PandaData[(PandaData.Column1 == 'England') & (PandaData.Column3 < 3.9)]

This will bring back the below:

   Column1  Column2 Column3
1  England  ball    3.5

If you are filtering on the same column you can also consolidate your conditions into one condition as opposed to multiple via using the isin() function and supplying a list of conditions , which can be seen below:


This gives the below output:

   Column1  Column2 Column3
2  Scotland square  15.0
3  Ireland  bowl    5.0

Getting the size of a Pandas data frame

Sometimes it is useful to know the size of a data frame you have, whether this be for validation purposes or for analytic work where your data frame changes size dynamically through your scripts and size is needed to be known elsewhere in the script. Pandas has its own set of two commands which give you the ‘size’ and ‘shape’ of your data frame. Both are used in the same way though and will be used on the same random PandaData data frame I have been using above:




The outputs of the above commands will give us ’12’ for the size command and ‘(3,4)’ for the shape command. This shows the distinction between what the two commands do. ‘Size’ will give you the total number of values within the data frame (rows * columns), while ‘shape’ will give the dimensions of the data frame (number of rows and number of columns). A further extension of the shape() command, that I personally find useful (for example in declaring limits of a range for a ‘for’ loop), is indexing its output:


This would output the number of rows in your data frame, while using a 1 instead of a 0 will give the number of columns; anything more than that will cause a subset out of range error.

Changing data types

It is sometimes useful to change the data types of either all the data in your data frame or just certain columns. An example of a situation requiring this to be done this occurred to me recently when I need to make a unique data key out of two columns containing year and week number. These two columns were ‘integers’, and so could not be concatenated as required. In these instances you can make use of the as.type() command:

PandaData = PandaData.astype({'Year':'string_', 'WeekNumber':'string_' })

Once again the input takes the form of a dictionary where you need to specify the data type you want, and you also need to save the output to a variable; most commonly you would save it to itself like I have above.

For reference, this website shows the different datatypes available quite well: https://bit.ly/3GoELAo

As I stated in the first paragraph you can also apply the function to the whole data frame like so:

PandaData = PandaData.astype('string_')


I hope that I have shown you why Pandas is a very useful library and one you should be using. Hopefully, you find the content of this blog post useful, whether that be in starting out with Pandas, or a fixing a problem as a common user. If you have any questions for me, then please put them in the comments section. My next blog post is set to come out March 17th 2022, so if you have any ideas of what you’d like to see in regard to Python and Data Science knowledge then please put them in the comments as well.

Power BI Databricks Spark connection error

When querying data from Azure Databricks (spark) into Power BI you may encounter an error: “ODBC:ERROR [HY000] [Microsoft][Hardy] (100) The host and port specified for the connection do not seem to belong to a Spark server. Please check your configuration.

This is usually caused by trying to connect to a ‘Standard’ Databricks instance, but Power BI (and ODBC in general) can only connect to Databricks using a ‘Premium’ pricing tier. You will need to upgrade this to be able to access Databricks/Spark from Power BI.

Another common and similar error when Power BI is querying data from Azure Databricks (spark) is: “ODBC:ERROR [HY000] [Microsoft][DriverSupport] (1170) Unexpected response received from the server. Please ensure the server host and port specified for the connection are correct.

The most likely cause of this error is an invalid server path, as you have to modify the path that Databricks gives you before using it.

In Databricks, open the cluster, and in Advanced Options click on the JDBC/ODBC tab, and copy the JDBC URL. It will look something like this:


All of the sections in red need removing or changing, so it ends up like this:


The username needs to be “token”, and the password needs to be a token, generated from the Databricks portal. Click on the user/silhouette icon on the top right of the screen, User Settings, Generate New Token.

Also when using Databricks, watch out for the type of connection you choose in Power BI, it should be ‘Spark’ not ‘Azure HDInsight Spark’.

Hope this helps!


Sampling data in Data Lake U-SQL for Power BI

Being able to hook Power BI directly into Azure Data Lake Storage (ADLS) is a very powerful tool (and it will be even more so when you can link to ADLS files that are in a different Azure account!! – not yet available as at January 2017). However there is a problem, Data Lake is designed to scale to petabytes of data whereas Power BI has a 10GB limit. Yes this is compressed, so we’d expect around 100GB of raw data, however to load this you need 100GB+ of RAM available on your PC, so it’s hard to actually reach the limit with a single dataset.

There’s obviously a disconnect in scalability here. In some datasets we can just use U-SQL to aggregate the data and pre-summarise by the list of fields that we actually want to analyse, and this is fine for additive transactional data. However if we need a many to many link or the granular details of individual rows of data then there’s an issue, how to we get this data into Power BI?

The answer is sampling, we don’t bring in 100% of the data, but maybe 10%, or 1%, or even 0.01%, it depends how much you need to reduce your dataset. It is however critical to know how to sample data correctly in order to maintain a level of accuracy of data in your reports.

Option 1: Take the top x rows of data
Don’t do it. Ever. Just no.
What if the source data you’ve been given is pre-sorted by product or region, you’d end up with only data from products starting with ‘a’, which would give you some wildly unpredictable results.

Option 2: Take a random % sample
Now we’re talking. This option will take, for example 1 in every 100 rows of data, so it’s picking up an even distribution of data throughout the dataset. This seems a much better option, so how do we do it?

— a) Use ROW_NUMBER() and Modulus
One option would be to include a ROW_NUMBER() windowing function in a U-SQL query that allocates each row a unique number.
We then apply a modulus function to the result, and only take those rows that return a 0
WHERE rn % 100 == 0;

This filters to only 1 in every 100 rows.

This method works in T-SQL, and just as well in U-SQL.


However, there is an easier way. U-SQL contains the ‘SAMPLE’ clause that automates this process. Thanks to Paul (T|B) for spotting this beauty.
SAMPLE Syntax:
SELECT xx FROM xx [SAMPLE [ANY (number of rows) | UNIFORM (percentage of rows)]]

There are two sampling options here, ANY and UNIFORM.

After not being able to find anything on the tinterwebs about them I ran some tests to see what they did and how well do these methods work compared to each other. The following code runs some U-SQL over a simple two column csv file containing an arbitrary id and a name. The 640MB file contains 400 names, each repeated a number of times to build 40m rows. Names were repeated using a normal frequency distribution pattern to make the data more representative of real world data.

To assess the output we can look at the distribution of the sampled data to see how closely it correlates to the distribution of the original dataset.

The U-SQL code looks like this:

DECLARE @INPUT_FILE string = @"/AWPF_Demo/Data/names.csv" ;   //40601850 ROWS
DECLARE @OUTPUT_FILE string = @"/AWPF_Demo/Data/names_results.csv" ;
@rawdata = 
        id INT,
        name string
    USING Extractors.Text();
//--Allocate each row a row number, and take 1 in every 100
@sequenceddata = 
    SELECT *
        , ROW_NUMBER() OVER (ORDER BY id) AS rn
    FROM @rawdata;
 @sampleddata1 = 
    SELECT id, name
    FROM @sequenceddata
    WHERE rn % 100 == 0;
 //--Use the SAMPLE ANY clause in U-SQL
 @sampleddata2 = 
    SELECT id, name
    FROM @rawdata
    SAMPLE ANY(406018); //--manually calculated as 1% of the input row count
 //--Use the SAMPLE UNIFORM clause in U-SQL
 @sampleddata3 = 
    SELECT id, name
    FROM @rawdata
    SAMPLE UNIFORM(0.01); //--1%
//--Find the name distribution of the original data, and sampled datasets
@outputbaseline = 
    SELECT name, COUNT(*) AS Qty
    FROM @rawdata
    GROUP BY name;
@outputdata1 =
    SELECT name, COUNT(*) AS Qty
    FROM @sampleddata1
    GROUP BY name;
@outputdata2 =
    SELECT name, COUNT(*) AS Qty
    FROM @sampleddata2
    GROUP BY name;
@outputdata3 =
    SELECT name, COUNT(*) AS Qty
    FROM @sampleddata3
    GROUP BY name;
//--Join all datasets together for analysis
@Output = 
    SELECT b.name
        , b.Qty AS QtyOrig
        , o1.Qty AS QtyMod
        , o2.Qty AS QtyANY
        , o3.Qty AS QtyUNIFORM
    FROM @outputbaseline AS b
        LEFT JOIN @outputdata1 AS o1 ON o1.name==b.name
        LEFT JOIN @outputdata2 AS o2 ON o2.name==b.name
        LEFT JOIN @outputdata3 AS o3 ON o3.name==b.name;
//--Output the data
OUTPUT @Output
USING Outputters.Text(quoting:TRUE);

So what happens when we take the resulting data and plot the sampled distributions against each other?

  • The Blue line is the number of times each name appears in the original dataset (on the right axis).
  • The Orange line is the distribution from the ROW_NUMBER() and Modulus.
  • The Yellow line is using U-SQL’s SAMPLE UNIFORM.
  • The Grey line is using U-SQL’s SAMPLE ANY.

As you can see, the SAMPLE ANY is a terrible option to maintain data accuracy. In effect it looks like it just takes the top x rows from the file and discards the rest, which I explained earlier is a bad idea.
However the ROW_NUMBER/Mod and the SAMPLE UNIFORM approaches are both staggeringly accurate to the original, with variances +/-2% for each name. This isn’t any good for exact numerical calculations (total sales £ for example), but for looking at trends over very large datasets this sampling approach is a good option.

So, should you use ROW_NUMBER/Mod or SAMPLE UNIFORM? Obviously SAMPLE UNIORM is simpler code, but how do they perform compared with each other?

  • The ROW_NUMBER/Mod approach, using the above dataset used a single vertex, with a total compute time of 29s, read 640MB and wrote 5KB.
  • The SAMPLE ANY approach used two vertices, with a combined compute time of 2s, read 34MB and wrote 5KB.
  • The SAMPLE UNIFORM approach used four vertices, with a combined compute time of 26s, read 766MB and wrote 5KB.

So the SAMPLE ANY, although poor for data consistency allows a much faster execution by only reading a small section of the data.
The ROW_NUMBER/Mod and SAMPLE UNIFORM approaches are very comparable in terms of performance, so it wouldn’t surprise me if they were doing something similar under the hood. However out of simplicity I’d recommend the SAMPLE UNIFORM method.


What is U-SQL?

By now you may have heard about U-SQL, the new kid on the query language block. But what is U-SQL? Where is it? What’s it for?

I was lucky enough to be at the 2015 MVP Summit in Redmond, at which one of the sessions was hosted by Michael Rys (@MikeDoesBigData), talking about U-SQL. As it’s creator, there’s no-one better to learn the ropes from. I was pretty much blown away by what it can do and the ease of access, so I’ve spent the last couple of months playing around with it, and doing a few talks at conferences about it.

But there’s a lot of confusion over what it is, so here’s my high level summary of what it is and where it fits, so you can judge whether it’s relevant to you.

What Does U-SQL Look Like?

It’s a hybrid of T-SQL and C#. Think if it as writing a SQL query, but being able to embed the power of C# and the .Net framework.

So you could write a query that looks like this:

@CustomersClean =
     ,((CustomerGender=="M" || CustomerGender=="F") ? CustomerGender : "-") AS CustomerGender_Clean,
     ,MyUSQLApp.Udfs.ValidateDate(CustomerDoB) AS CustomerDoB_Clean
  FROM @Customers;

So the basic SELECT * FROM Table syntax is familiar to anyone used to T-SQL. But with the addition of some C# to clean up the Gender and DoB columns. For the DoB, we’re calling a user defined c# function (ValidateDate) that sits in an assembly behind the query.

How do I  use U-SQL?

Data Lake U-SQL is a part of Azure Data Lake. Data Lake is a core component in the Cortana Analytics Suite, Microsoft’s cloud/Azure data platform. Think of Data Lake Storage as a massively scaleable and high performance staging and permanent storage area for all of your structured, semi-structured and non-structured data, based on the Hadoop File System (HDFS). It can store both tables of data and files (csv, text, images, whatever you want to store).

Once your data is held in Data Lake Storage, you can use Data Lake Analytics to do cool stuff with it. This could include ‘Big Data’ processing and analytics, i.e. running large and complex parallel analytics queries over your data, whether that’s MB, GB, TB or even PB. Or it could be transforming and manipulating your data into different structures.

Either way, Data Lake Analytics uses U-SQL as it’s language. At the moment this is the only place that U-SQL exists, but given its power and flexibility I’m hoping it will also have a bright future outside of Azure Data Lake.

You can write and execute U-SQL directly in the Azure Portal, or for a much richer development environment you can install some plugins into Visual Studio 2013 or 2015. I recommend the latter.

Download the Azure SDK for .Net (2.7.2+) here

Download the Data Lake Tools for Visual Studio 2013 or 2015 here

So it’s a Big Data tool?

Yes, and no.

Yes it has the scalability, performance and functionality for Big Data, and this is what it’s designed to do. When you write a simple U-SQL script, the Data Lake Analytics engine takes the query and parallelises it, scaling it up to as many nodes as you want. So it’s scalable. Very scalable. It’s being used now for complex big data analytics jobs with TB of data, and can scale way beyond that.

But it’s more than this, I also see it as a data preparation and transformation tool. If you want an Azure based BI solution, maybe using Azure SQL DW, or even a smaller scale system using Azure SQL DB, then we don’t yet have a PAAS ETL tool in Azure. But now we do! This provides a very powerful and comprehensive platform for all your data transform needs.

What about Azure Data Factory (ADF)? – No, ADF an ETL tool, it’s an orchestration and coordination tool, and isn’t realistic as a data transformation tool.

What about ELT? – Yes that’s fair, to a point. But in my experience the ELT approach actually needs to be ETLT to work properly. The 2nd ‘T’ is taken care of inside the SQL engine, but a lot of source data has to be pre-processed before it can even be landed in a SQL Server table, U-SQL provides that power to us that T-SQL cannot provide.


I’m not going to go into any more detail here with how to use or write U-SQL, there will be more posts to come on that.

But I hope this provides some clarity on what it is and where it fits in our ever expanding data platform world.

Data Lake went into public preview in October 2015, so it’s ready and waiting for you to have a go!




Power BI Sentinel
The Frog Blog

Team Purple Frog specialise in designing and implementing Microsoft Data Analytics solutions, including Data Warehouses, Cubes, SQL Server, SSIS, ADF, SSAS, Power BI, MDX, DAX, Machine Learning and more.

This is a collection of thoughts, ramblings and ideas that we think would be useful to share.


Alex Whittles
Jeet Kainth
Jon Fletcher
Nick Edwards
Joe Billingham
Lewis Prince
Reiss McSporran
Microsoft Gold Partner

Data Platform MVP

Power BI Sentinel
Frog Blog Out