Looking For Nuance: How Text Analysis Works

Jeroen Reizevoort

Unstructured data in the form of reviews, social media messages, CRM records, call centre notes, and reactions to blog posts and articles contain valuable insights for business. As mere humans, we find it almost impossible to draw any conclusions from these large amounts of data. Machines, on the other hand, can do this using text analysis.

How exactly does that work? Online reviews are often a great source of information before we commit to a major purchase. We don’t look only at the number of stars; we also read comments from other customers. Subconsciously, we’re looking for key concepts that are relevant to the purchase, such as quality, service, comfort, and appearance. Subconsciously we analyse the unstructured texts written by others.

Missing nuances

It’s a different matter for organizations. They can also gain a lot of insights from unstructured texts, but there are simply too many texts to do it manually. This is why organizations are limiting themselves to computerized analysis of structured data (the number of stars mentioned above). They miss out on any nuance.

Grammar and semantics as a basis

Machines are now capable of putting computerized structure into unstructured text with the aid of text analysis. They do this more or less in the same way our brain does this. Both man and machine are using grammar and semantics Picture1as their support system. Grammar may not have been the favorite subject of many in elementary school, but basic grammar knowledge helps us to put structure in segments of text. Based on this structure (grammar) and our knowledge of the world/context (semantics), we are capable of finding the essence in unstructured text. This also applies to automated solutions.

In order to analyse text on a large scale, computers need to learn grammar and semantics. We call this computerized text analysis. Text analysis is the analysis of unstructured text, the extraction of relevant information from that text to then convert that information into structured information that is suitable for further analysis.

Structure is first created in sentences on the basis of the following steps:

  • Determine the language. Because languages apply different grammatical rules, it is essential to determine the language automatically.
  • Find the individual words. The text is divided into paragraphs, sentences, and words. Also bear in mind the removal of punctuation marks and the splitting of compounds.
  • Determine what the words are deduced from: “We thought the rooms were spacious and clean.” Thought -> think; rooms -> room.
  • Analyse the sentences. “[We, SUBJECT] [thought, FINITE VERB] [the rooms, DIRECT OBJECT] [spacious and clean, COMPLIMENT].”

Once the unstructured sentences are given structure, we can perform a sentiment analysis on the text. A hotel chain, for example, can deduct from the sentence, “We thought the rooms were spacious and clean” that this client had a positive opinion about the size of the room and hygiene. Had the client written, “We thought the rooms were very spacious and reasonably clean,” then the analysis would show that the client was very positive about the aspect size of the room and moderately positive on the aspect hygiene.

Context-dependent

Whether characteristics such as small, slow, etc. should be explained as positive or negative depends on the context in which they are used. The aspect quick is positive for a fast-food restaurant, but likely negative for enthusiasts of more relaxed dining.

Words also have a different meaning depending on the context in which they are used. The expression “breaking the bank,” for example, doesn’t mean we are literally tearing down the bank.

We therefore often use domain specific word lists for a more effective text analysis. The use of these lists makes the solution more intelligent. Human experts add that intelligence (semantics) per domain.

Applications

So far, I have positioned text analysis for sentiment analysis, but this is only one possible application. Here are some other examples:

  • matching applicants to vacancies
  • automatically reply to customers’ emails
  • analysis and categorization of scientific articles (the number of scientific publications doubles every 9 years)
  • analysis of medical files (approximately 80% of the content of medical files consists of unstructured text).

There are even experimental websites that analyse your personality on the basis of your LinkedIn profile. It may only be a matter of time before dating sites will add this service.

Wealth of information

Text analysis will help us use the wealth of information hidden in the growing mountain of unstructured text. Domain-specific word lists can be used to refine results. In this way, text analysis contributes to machine learning. Soon we will be able to stop worrying about how we work our way through mountains of text to find all nuances.

Good. That means we finally have time to manually read a good book.

For more on how data can enhance marketing efforts, see Why Data-Driven Marketing Is The Next Big Thing.


About Jeroen Reizevoort

Jeroen has been working in IT for more than 25 years. In this period he has been active as a programmer, information analyst, project manager and, more recently, as pre-sales software architect. In this last role he has focused on integration, business process management, business rules management and mobile. Jeroen is currently working at SAP as pre-sales enterprise architect. He advises organizations on the definition of, and transition to, a modern, flexible architecture that is ready for the digital economy.