Transformers - Rise of the New Beasts: A 100% subjective Test of the systems AI21, ALEPH ALPHA, MUSE versus GPT-3

28. November, 2022 | Maximilian Vogel

In the last few month, some promising language models went live challenging GPT-3. I tested a few like these platforms and threw data at them. Always with questions in mind like: Can other models compete with GPT-3? Or are they even better?

All platforms bring special features that are sometimes difficult to compare: For example, the ability to generate and understand program code, SQL- like table processing, or to cover specific languages – the Muse language model is very strong in French, for example. To make a fair comparison, I did some basic tests in the area of general knowledge, as well as the ability to draw logical conclusions. The testing language was Englisch.

Short intro of the platforms:

Muse is a language model from Paris-based startup LightOn, which has received funding of about close to $4 million. Muse has a focus on European languages.

Showcase Muse: Responding to restaurant reviews with emojis (User: regular text; Model: text bold)

Aleph Alpha is a startup from Heidelberg in Germany. Aleph Alpha has raised almost $30 million from VCs. In addition to a very good pure speech interface, Aleph Alpha has the ability to process multimodal input.

Showcase Aleph Alpha: Response to a combined image-text input. (User: image input and regular text; Model: bold text), (Image credit: Lily Banse on unsplash, https://unsplash.com/@lvnatikk)

AI21 Labs is a Tel Aviv- based startup and last received funding in July 2022. In total, the company raised $118 million. The platform developed by AI21 was one of the first GPT-3 competitor models.

Showcase: Query databases in natural language. The prompt for the model contains structured information (user: regular text; model: bold text) and answers specific questions based on this. This model is also very good at handling completely unstructured text input.

GPT-3 from OpenAi in San Francisco is the LLMs that has probably made the biggest splash in recent years and is a kind of gold standard for language models.

Here is an overview of the participants in the small competition including specific models and test settings.

But was is a large language model anyway?

Large Language Models are systems that usually have a specific architecture (transformer) and have been trained with gigabytes and terabytes of texts from the internet (e.g. Wikipedia). Based on this data, they derive the probabilities of word sequences. In principle, these machines can predict how a sequence of words, a sentence, a story, a conversation will continue. And that’s just based on their gigantic learning data set.

The test: general knowledge and simple reasoning

I have the models compete agains each other by testing their knowledge of the world and their ability to reason. Models that pass it can be used for a variety of individual applications: e.g., answering customer questions, analyzing and summarizing texts, and automatically processing mails. This test is specifically designed for Large Language Models: Classical rule-based language systems like Alexa drop out after a certain complexity of questions and can no longer answer them. I asked Alexa, as the most advanced classical language model, the same questions in order to compare the capabilities of different platform architectures.

1) Facts

These questions each have one or more exactly correct answers. Each model should be able to answer these questions.

2) Soft knowledge / reasonable assessments

The answers to the following questions cannot be answered with pure factual knowledge. A conclusion must be drawn or an assessment must be made here. Different answers are possible, several can be correct to some extent.

3) Tough nuts to crack with logical reasoning and contextual understanding

Now it gets very, very difficult. I ask questions that even humans can’t answer easily.

A correct answer to the question would be that there is no exact data on this, or that there are probably about as many married men as married woman.

Final countdown – How do the new beasts among the transformers fare agains GPT-3?

The result of my subjective test was partly unexpected:

Many thanks to: Kirsten Küppers, Hoa Le van Lessen and Almudena Pereira for inspiration and support with this post!

Stay tuned: My next tests will focus on how the models perform with business related issues.

Read the full article on medium.com/@maximilianvogel