In this multi blog series I will go through what web scraping is, what Natural Language processing is as a general term as well as diving into some constituent techniques we are interested in; key word extraction, sentiment analysis and its derivative opinion mining.
The last few parts will then go through a coded example of scraping the popular review site Trust pilot for reviews of the popular supermarket chain ‘Lidl’. We will then perform key word extraction and natural language processing on these before finally consuming these in a power bi report.
In part 1 I will be discussing what web scraping is, how it is done and common applications.
What is Web Scraping?
As you may be aware the internet is awash with web pages, all of them holding data making the internet and its constituent web pages a massive resource of information. The only problem being that these websites are in isolation and it can be very time consuming going through and copy and pasting all this information (which technically is web scraping but on a much smaller scale than desired), without even thinking about sorting and analysing this data. This is where web scraping comes in through the use of ‘scrapers’ which are pieces of software which are programmed to visit websites and get all the relevant information we want in an automated process. This allows us to gain access to vast amounts of data in a small time frame.
How does it work?
Initially this works via what is known as a HTTP request, which is done in the background every time you enter a website, it is like ringing a doorbell and asking if you can come in; you are asking for entry to the site. Once approved you then have access to the site, which will have its data in XML or HTML format, which defines the structure and content of the site. The scraper will ‘parse’ the code, meaning to break it down into its parts, from which you can then identify and extract what you are looking for. This is the barebones workflow of what a scraper does, but in truth it can be very tricky and convoluted depending on the structure the site has for its XML and HTML, as on occasion they will be made intentionally hard to scrape. There is also the grunt work of going through the code yourself to identify where the information you want is, so you can then code the scraper to get it automatically; this is probably the most time consuming part.
While there are many off the shelf products that do web scraping for you, I will not being going into them in this post, but simply make you aware that they are their but they will not do as bespoke a job as writing the code yourself. To this end I will now go through packages you would commonly use to perform web scraping in Python and R. As I briefly eluded to above, webscraping is a two stage process of getting access to data and then parsing data, to this end we need two separate packages ; for R most commonly these are crul or httr for handling our HTTP connections to get to the data we want to scrape (github links are there for the packages), and for the data parsing it would be best to us rvest (tidyverse page linked). In regard to python, the two main packages (and the packages I will show in the demos) are urllib for handling the HTTP connections and BeautifulSoup for the data parsing. You will want to code a function around each of these packages to call upon when you write you’re webscraper; but I will show you how to do this when I post the code demo.
As stated earlier, we web scrape to extract knowledge from the internet; that is the application. So it is better to think about what kind of data we are trying to get and how we will use it when we talk about common applications. I have noted some below:
- Price Monitoring – Companies use web scraping to get product data for both their own and competing products to comparing pricing etc
- News Monitoring – News sites can be scrapped for certain information, this can be useful if you’re a company that is frequently in the news, or deal with stocks or other forms of international trade where world events matter to your business.
- Reviews Analysis – This is the application that I will be showing you in this blog series. Companies like to know what people are saying about them and their products, but this can be time consuming and laborious to look at each individual review, which is why scraping these site and performing analysis on the data is a much preferred option, which in fact allows for much deeper insights.
In my next blog post I will go into what Natural Language Processing is and what it can do for you.