Here’s the final post in this blog series on natural language processing where we are going to bring everything together and web scrape Trust Pilot for review data, which we will then perform Natural Language Processing on and then display in a Power Bi dashboard. I’ll be talking exclusively practically in this demo, so for a refresher on the theory please refer back to my earlier blogposts (Part 1, Part 2, Part 3). To re-iterate the goal here will be to scrape Trust Pilot for reviews on ‘Lidl‘ , we will then extract key words from these reviews, as well as opinions and sentiments of both the reviews and the opinions. We will then share this in a Power BI dashboard so you can interact with the data and digest it in as simple a way as possible.
Web Scraping Trust Pilot
As was said in Part 1 web scraping is a method of extracting raw data from online resources such as websites for your own uses. This allows us to get to massive resources of information and supplemental data which may not otherwise be available.
So, getting straight into it, we need to start coding our ‘scraper’. As I stated in part one common packages to do this in Python are ‘urllib’ for handling the HTTP connections and ‘BeautifulSoup’ for the data parsing; so we will be using these. At the start of your code you will need to import and install the following:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
from urllib.error import HTTPErrorImport pandas as pd
import numpy as np
The code to initiate the http request is as follows:
So, we have performed our request, got our html and then passed it to BeautifulSoup our html parser. We can now look at the html to decide our next steps. You will find the html in the variable ‘soup’. I’m not going to post the whole html here as it is massive, but the goal is to evaluate the structure of the html so we can further query it to get the information we want. Upon inspection we find that the reviews are held in JSON format within a section of the HTML, so we need to find the html start tag of the section of the html with the JSON in and then iterate from there to the reviews:
Once we have then isolated just the reviews we can iterate through using a python for loop to extract the information we want into well described variables (as can be seen in the for loop above). Now it must be noted that we have currently only have information for the reviews on the first page of Lidl reviews that Trust Pilot has, as each page has a different URL. To test this, click ‘2’ for the next page on the website and you will see the URL change with the addition of a page number in there now. To get the rest of the reviews we need to take our code from earlier and place it in a loop to iterate through page numbers. I have set it to go from 2-1000 pages, which is well above what is already there but has been done to futureproof the code. We want to gather all the reviews, but there will always be reviews being added and so pages being added but the code will also break if we try to use a URL that doesn’t exist, hence why we wrap this in a try command which will identify if we have a ‘HTTPError’ (incorrect URL) and the break out of the code and carry on with the next bits of code we have if we try a URL which doesn’t exist.
The next (and final) steps are to produce a unique identifier for each row of data we have and combine this with all the variables we have created to make a dataframe with all our data in that we can save as a csv and use for Natural Language Processing. The reason we need a unique identifier is so that when we create dataframes with the data produced from the Natural Language processing we can link it to the review it originated from, which means we will be able to use all the data to create some visuals in Power BI later on!
Key Word Extraction
We now have a csv with all our review data in, so we can start using Natural Language Processing techniques on these datasets; we will start with Key Word Extraction. As stated in Part 2 Key Word Analysis takes sections of text and just brings back words of that are key to the meaning of the sentence. I am going to work on the assumption that you have already set up your Language Service Endpoint from Azure Cognitive Services (this is straight forward to do so from their website), and you have your key and the URL of the endpoint to hand. To get started you need to install and then import in your code the latest versions of ‘pandas’ and ‘azure-ai-textanalytics’. We then need to firstly authenticate ourselves with the endpoint to satisfy it that we have access and then create a function that calls our Language Service Endpoint, of which they give an example of on their website, but we need to adjust it slightly to our needs:
This does the authentication in the first section, and then we have the function which calls our service and provides it a piece of text to work with and then appends key words to an outside variable which we will have set up when we call this function. Once this is set up you want to import your csv file (I would suggest using pandas.read_csv() to do this) and save it to a variable. Then we can use the following code to go through these reviews and send them to the endpoint one by one. Note that I have combined the title of the review and its contents, this is because there may be important information in the title that is not contained in the body of the review.
This will give you a list of phrases as well as a list of corresponding IDs that will match it to its original review from the csv we are using. You just need to put the phrases and IDs into a dataframe (use pandas.dataframe()) and then save as csv (use ‘name of your dataframe’.to_csv()) with appropriate headers to consume in the PowerBI dashboard later on.
Sentiment Analysis and Opinion Mining
Before starting this section I would suggest reading Part 3 again to understand the concepts we will be using to generate the data and what exactly we will be generating. The process here is going to be nearly identical to what we used for the Key Word Analysis, we need the same packages installed and imported, we need the same key and endpoint as we will be using the same Language Service Endpoint, the big difference is that a lot more data is going to be kicked out so we need to prepare a good lot more variables to store these in:
Then read the scraped data in again (or not if you are writing this in the same python script). For the functions we can reuse all the code up till after the ‘Client = authenticate_client()’ from the Key Word Analysis as it will be the same, but you will need the following function which will do the legwork and save the data in the variables above:
then call this function in a loop through your scraped data, a simple loop this time though as the function can do all the work for you in this instance:
When this has ran collect all the variables up in a dataframe and then save as a csv to use in the PowerBI dashboard which we will go through in the next section!
Make the PowerBI dashboard
We will assume that you have PowerBI already downloaded and you know the fundamentals already. You need to import the three datasets we have created (Scrape, Keywords and Sentiment Analysis/Opinion Mining), and we would suggest the following data model:
You will note the following additions that will need to be made:
- Creation of ‘Assessment_Map’ table- This is basically a copy of the ‘Opinion_Mine’ table, without the stuff we already have in the main ‘Lidl_Scrape’ table. In addition we have included counts of Assessment Sentiments. These are created using the following DAX;
Mixed Count = CALCULATE(countrows(Assessment_Map),filter(Assessment_Map,Assessment_Map[Assessment_Sentiment] = “mixed”))
- Creation of ‘Assessment’, ‘Rating’, ‘Subject’ and ‘KeyPhrases’ tables- These are lists of certain fields with duplicates removed (handy tip to make everything either higher or loser case when removing duplicates as PowerBI is case sensitive and will see ‘bad’ and ‘Bad’ as different words for example). We will use these as slicers to filter the data.
- TrustPilot Review Count field- Within the ‘Lidl_Scrape’ we would suggest taking a count of rows in that table. This is so when used in visuals you can take a count of reviews that match a criteria etc.
From all this we then set up our dashboard as above. There is a lot you can put here, but I feel we have been able to show a lot of insight into the reviews with what we have here. Also allowing us to filter by multiple parameters and fields really lets you dig deep into the reviews. The beauty here is that you can do this for any entity which has a Trust Pilot account, as the html and JSON formatting will be the same to scrape, with the fields being the same at the basal level for all reviews you will scrape. Therefore in regard to Trust Pilot we have made a nearly universal solution here, and I say only nearly because for some entities they will also have location in their URL, with each location of an entity having its own Trust Pilot site, so you would have to cycle through a list of these to get all reviews for an entity. So make sure you check for this before trying this solution on other entities.