Big Data

  • What is Microsoft Fabric? (Power BI + Synapse + DW + DataLake + ML)

    What Is Microsoft Fabric

    At today’s Build conference, Microsoft announced Fabric. What is this? In simple terms, think of taking Synapse Analytics, Data Warehousing, Data Lakes, Data Factory, Spark Notebooks and Machine Learning, and bring them all together into Power BI. This is underpinned by Microsoft OneLake, a high performance scalable data lake storage layer, supporting all of the above. OneLake is, as the name implies, one data lake that can be used across…

    » Read more
  • PySpark Problems: Using Map() gives the error “TypeError: unsupported operand type(s) for /: ‘Row’ and ‘float’ “

    Here’s the first in what will be an adhoc series of short blog posts where I will write a paragraph on the solution to problems I come across when I’m using PySpark. In this post I will discuss using the Map() function to apply a function to every value in an RDD and then getting the error message: “TypeError: unsupported operand type(s) for /: ‘Row’ and ‘float’ “ We see this error because we are…

    » Read more
  • Part 4: Natural Language Processing – Bringing it all together!

    Here’s the final post in this blog series on natural language processing where we are going to bring everything together and web scrape Trust Pilot for review data, which we will then perform Natural Language Processing on and then display in a Power Bi dashboard. I’ll be talking exclusively practically in this demo, so for a refresher on the theory please refer back to my earlier blogposts (Part 1, Part 2, Part 3). To re-iterate the goal…

    » Read more
  • Purple Frog at AI and Big Data Expo

    During the 1st and 2nd of December, Purple Frog descended on London Olympia for 2 days of AI, ML and Big Data based fun at the AI & Big Data Expo. The AI and Big Data Expo is a leading Artificial Intelligence & Big Data Conference & Exhibition that showcases the next generation enterprise technologies and strategies from the world of Artificial Intelligence & Big Data, providing an opportunity to explore and discover the…

    » Read more
  • Part 3: Natural Language Processing – Sentiment Analysis and Opinion Mining

    If you remember in part 2 we discussed what Key Word Analysis is and how this can be implemented to gain deeper insight from textual data. But we can go one step deeper and extract feelings and opinions from the same data. We can do this through Sentiment Analysis and Opinion mining! In this blog I will talk you through what they are and how we can implement them using Microsoft’s Cognitive Services. What is Sentiment Analysis? We should…

    » Read more
  • Part 2 : Natural Language Processing- Key Word Analysis

    Here we are with part 2 of this blog series on web scraping and natural language processing (NLP). In the first part I discussed what web scraping was, why it’s done and how it can be done. In this part I will give you details on what NLP is at a high level, and then go into detail of an application of NLP called key word analysis (KWA). What is NLP? NLP is a form of artificial intelligence which deals with the interactions between humans…

    » Read more
  • Pandas; Why Use It And How To Do So!

    Introduction Hi and welcome to what will be my first frog blog! I’m Lewis Prince, a new addition to the Purple Frog team who has come on board as a Machine Learning Developer. My skill set resides mainly in Data Science and Statistics, and using Python and R to apply these. Therefore my blogs will be primarily on hints and tips on performing Data Science and Statistics through the medium of Python (and possibly R). I thought I would start…

    » Read more
  • Power BI Databricks Spark connection error

    When querying data from Azure Databricks (spark) into Power BI you may encounter an error: “ODBC:ERROR [HY000] [Microsoft][Hardy] (100) The host and port specified for the connection do not seem to belong to a Spark server. Please check your configuration.“ This is usually caused by trying to connect to a ‘Standard’ Databricks instance, but Power BI (and ODBC in general) can only connect to Databricks using a…

    » Read more
  • Sampling data in Data Lake U-SQL for Power BI

    Being able to hook Power BI directly into Azure Data Lake Storage (ADLS) is a very powerful tool (and it will be even more so when you can link to ADLS files that are in a different Azure account!! – not yet available as at January 2017). However there is a problem, Data Lake is designed to scale to petabytes of data whereas Power BI has a 10GB limit. Yes this is compressed, so we’d expect around 100GB of raw data, however to load…

    » Read more
  • What is U-SQL?

    By now you may have heard about U-SQL, the new kid on the query language block. But what is U-SQL? Where is it? What’s it for? I was lucky enough to be at the 2015 MVP Summit in Redmond, at which one of the sessions was hosted by Michael Rys (@MikeDoesBigData), talking about U-SQL. As it’s creator, there’s no-one better to learn the ropes from. I was pretty much blown away by what it can do and the ease of access, so I’ve…

    » Read more