Here we are with part 2 of this blog series on web scraping and natural language processing (NLP). In the first part I discussed what web scraping was, why it’s done and how it can be done. In this part I will give you details on what NLP is at a high level, and then go into detail of an application of NLP called key word analysis (KWA).
What is NLP?
NLP is a form of artificial intelligence which deals with the interactions between humans and computers, especially in regard to how to get computers to ‘understand’ large amounts of ‘natural language’ data. Natural language being any language which has developed naturally; that has come into being without conscious planning or intent. Examples of natural languages can be summed up by the romance languages of French, Spanish and Italian. It may seem that this is already a quite niche field of study, but it is quite diverse with the applications and outputs covering both the written and spoken versions of languages. Common applications range from speech recognition, text to speech conversion and optical character recognition (recognising text in an image), to sentiment analysis (emotional context of text) and word segmentation and its derivative KWA.
What is KWA?
KWA is something we do multiple times each and every day without even realizing it. Every time you receive an email or text message and you skim the title and who sent it, maybe even parous a few paragraphs; your brain is identifying the key words of the text to derive the key messages and context. This is what a computer is trying to do when we want it to do key word analysis; identify the important words and phrases to get the context of the text and extract the key messages. For example, Microsoft give a good example in their documentation for their language Cognitive Services (a suite of tools we will utilize later in this blog series) which uses the sentence “The food was delicious and the staff were wonderful”. Using KWA, the outputs would be “food” and “wonderful staff“.
How can one perform KWA?
There are many open source packages to perform KWA in Python and R, but equally there are resources which make the process even easier. One of these is contained in Microsoft’s Cognitive Services! Which is a suite of tools designed to put AI into the hands of developers and data scientists. The tools are defined in 5 separate groups:
- Speech – Contains resources such as Speech to Text and Speech Translation.
- Language – Contains resources such as Sentiment Analysis and Key Word Analysis.
- Vision – Contains resources such as Computer Vision and Face Recognition.
- Decision – Contains resources such as Anomaly Detection and Content Moderation.
- OpenAI Service – Is an open ended resource for the application of advanced coding and language models for a variety of uses.
For more info on any of these, there are summaries here.
To perform KWA we are only interested in the Language service! Microsoft provide their Language services via the Language Studio. If you follow the link provided, it will ask you to log in using your azure credentials and also prompt you to provision a Language Provision to use the services in the language studio; I would recommend the free option (F0) when asked what pricing tier you want as this gives you more than enough headroom to test the service out in apps etc for your own interest. See below a screenshot of what this looks like:
We can see that Language studio is further split up into 5 sections:
- Extract Information – You can see from the screenshot what you can do there.
- Classify Text – Can allow you to analyse sentiment and mine opinions, as well as detecting the language of a text for example
- Understanding questions and Conversational Language – This mainly pertains to things associated with chat bots; such as answering questions and understanding what is being asked in the first place.
- Summarize Text – There is only one service in this section, and it pertains to extracting important or relevant information from documents.
- Translate Text – There is only service in this section as well and it does what it says on the tin; it translates text and speech into other languages.
As we are only interested in KWA in this blog post, we want to navigate to Extract Information and click the Extract Key Phrases tile, so we can try out the functionality of this resource.
You will be met by the top screen (I would recommend clicking the view documentation link at the top to see Microsoft documentation). The first thing to know with the main section of this page is it is showing an interface which is utilising an API which is provisioned to provide Azure’s Extract Key Phrases service, and that is how it is used in practise (and how I will show you to use it in later blog posts in this series). This allows you to see what you can input to the service and the kind of outputs you can expect. You have three options in regard to input, typing it in yourself, uploading a file or clicking one of the samples. Once you have given it an input you need to tick that you are aware that you may incur a charge (this is only if you selected the standard pricing tier) and then click run. I used sample one, which gave the below output:
You can see we get a pretty comprehensive list of key words from the sample statement, but from a programming and functionality point of view this isn’t what we are interested in, and is in fact a derivative of the actual output you get from the service which is a JSON file. If you click the JSON tab at the top of the results you will see these results, which is what we really want as we can ingest this in whatever programming language you choose (we will use python when I go through this in later posts).
In my next post I will discuss the topics of Sentiment Analysis and it’s derivative Opinion Mining in regard to what they are, how they do what they do, as well as how you can use these within the Azure Language Studio.