Natural Language Processing – a brief history

23. March, 2022 | Alexander Polzin

Computers are insanely good at storing information in an orderly fashion. But in order to make this information usable for humans, appropriate interfaces are needed. Early in the history of computers, research began on one of the most convenient interfaces: The user asks the computer a question in natural language and the computer returns the answer in natural language.

In the following, examples will explain different techniques that are used or have been used to make the ‘talking computer’ a reality. Speech synthesizers and speech-to-text interfaces are not the subject of this text. These are taken as given.

Before Siri and Alexa, there was Eliza

Probably the most famous early conversational agent/ chatbot is Eliza. Developed in the 1960s, Eliza uses technologies that are still in use today. The core technologies behind Eliza are pattern matching, combinatorics and Eliza Scripts. Eliza Scripts are ontologies at their core. So the user’s input was recognized by pattern matching, then rebuilt by recombination so that the answer was actually a semantically correct answer, and in the end enriched with additional words from the information network. The result was a better parrot but at least the first project that was even considered for the Turing Test. Ontologies have the goal not to structure information hierarchically, but to create a network of related words and topics. Until today ontologies and pattern matching are still used for NLP and especially for performant searches. Until recently, Google search was also based on ontologies.

The AI community’s bitter lesson

In the years that followed, two branches of research developed within AI research that sought to teach computers to ‘think’ and ‘speak’ in completely different ways. One branch dealt with actual concepts of human thought, actual grammar, physical rules, etc. The other branch used general methods, statistics, and computer language. The other branch used general methods, statistics and above all: the best and fastest computers available.

When a computer defeated a chess grandmaster for the first time in 1997, it was a computer that had no idea about the rules of chess. It didn’t even have a clue that it was playing a game. All the computer could do was pattern matching (in this case, the pattern of pieces on the chess board), statistics, extremely quickly search through many patterns and play through them to perform the best pattern transition. All this to quickly reach the ‘win condition’. In this case, the ‘win condition’ was a board with no opposing pieces. At this point, at the latest, the AI community learned the ‘bitter lesson‘. History confirms until today that generic methods, statistics and fast computers are always better than trying to teach the computer actual knowledge.

What does this mean for NLP applications?

On the one hand, decision tree-based solutions for conversational agents/ chatbots have become established for the first time. At BIG PICTURE, we use Google’s Dialogflow, among other things, to implement chatbot projects for customers.

On the other hand: We need a bigger boat!

For a long time, the statistical approach seemed to be promising for NLP problems as well, but the computational power was lacking to achieve similarly good results for human language as they have been achieved for image recognition for some time, for example. However, initial experiments have been quite successful. Computers are able to produce semantically correct staff rhymes without knowing what a staff rhyme is in the first place. However, for a very long time it was not possible to produce meaningful text by statistical methods. With the arrival of GPT-3, however, we have come much closer to this goal. The underlying technologies are still generic methods and statistics.

We have a tokenizer that converts speech into lexical tokens for us
We make use of bayes based statistics/inference to find out which token is most likely to follow the previous one
We use x layers of large Self Attention Models for classification

But what distinguishes GPT-3 from its predecessors is: a huge Microsoft data center and enormous amounts of data. GPT-3 has almost the entire Internet as its database. This enables it to produce texts that are virtually indistinguishable to humans from texts that were actually written by humans. GPT-3 is, at its core, a text completion engine. It has tokenized almost the entire Internet and tries to complete user input by using the most likely token that follows it and then the most likely token that follows the token and so on and so forth. This is all based on ‘simple’ statistics. You could say GPT-3 has read the whole internet and knows which phrase follows the previous phrase most often and is usually right.

But what does all this mean for our daily work on chatbots/ conversational agents?

Decision trees, like Dialogflow’s, will continue to be the core of many chatbots/ conversational agents for the foreseeable future. They are deterministic and much less error-prone than GPT-3, making decision trees ideal for e.g. voice operation of a vehicle or when querying specific processes. However, models such as GPT-3 will be used more and more in the future. For example, they are an ideal interface for operating instructions, product assortments, etc. The vast majority of texts of this type are structured hierarchically. This means that the texts with content are ‘hidden’ behind a tree structure. This can make a search difficult or even impossible if the user does not know how to enter the tree structure. This is where GPT-3 can help in a useful way.

So in the future we will see many hybrid solutions that have a deterministic entry via a decision tree and that access a fine-tuned GPT-3 model at deadends.

The statements in this article reflect the views of the author and not those of any company, organization or institution.

Cover picture: Photo by Claudio Schwarz on Unsplash