The Best AI Model in the World: Google DeepMind’s Gemini Has Surpassed GPT-4



A few hours ago Google and Google DeepMind announced their much-awaited AI model, Gemini. There’s still not much hands-on feedback on how well it works, but the reported performance is outstanding: It’s better than GPT-4 in pretty much everything.

This article is a quick overview (divided into easy-to-skim sections) of the info we have so far and my first impressions from what I’ve read (haven’t tested the model yet). I’ll go deeper over the coming days as we acquire a better understanding of what Gemini can do, how it’s built (hopefully), how it works, and what it means for the future of AI.

Here’s the outline:

  • Gemini specs, sizes (Ultra, Pro, Nano), and availability.

  • Gemini Ultra is better than GPT-4.

  • Gemini is natively multimodal.

  • My first impressions from the available info.

Gemini is a family of models that comes in three sizes: Ultra, Pro, and Nano. Here’s a summary of the technical report’s section on Gemini sizes and their specifications.

Gemini Ultra is the version that achieves state-of-the-art (SOTA) benchmarking and surpasses GPT-4 across benchmarks (as we’ll see soon). It’s designed to run on data centers so you won’t install this one on your home computer. It’s still under red-teaming safety review but it will be available in early 2024 on a new version of Google’s chatbot, Bard Advanced.

Gemini Pro is comparable to GPT-3.5 (not always better, though) and it’s optimized for “cost as well as latency.” If you don’t need the best of the best and costs are a constraint, Pro is probably a better choice than Ultra (just like ChatGPT with GPT-3.5 being free is, for most tasks, better than paying $20/month for GPT-4). Gemini Pro is already available on Bard (“its biggest upgrade yet”) in 170 countries (not EU/UK) in English. Google will extend availability in other countries and languages later.

Gemini Nano is the device-targeted model. Google hasn’t disclosed the parameter count of Ultra and Pro, but we know Nano is divided into two tiers, Nano 1 (1.8B) and Nano 2 (3.25B), for low- and high-memory devices. Gemini Nano is built-in on Google’s Pixel 8 Pro, which will become an AI-enhanced smartphone all around. This is the beginning of super-Siri mobile assistants. Gemini will also be “available in more of our products and services like Search, Ads, Chrome and Duet AI,” but doesn’t specify which size or when.

All of them have a 32K context window which is notably smaller than the largest ones, Claude 2 (200K) and GPT-4 Turbo (128K). It’s hard to say what size of context window is optimal (depends on the task, obviously) because it’s been reported that models tend to forget a good portion of the context knowledge if the size is too large. Gemini models reportedly “make use of their context length effectively,” which is likely an implicit reference to that type of retrieval failure.

As you would expect given the ubiquitous preference for closeness in the AI space today, we know nothing about the training or fine-tuning datasets (except that the dataset comprises “data from web documents, books, and code, and includes image, audio, and video data”), or the architecture of the models (besides that they are “build on top of Transformer decoders” and “enhanced with improvements in architecture and model optimization”).

It’s rather hilarious to say this but we will have to wait until Meta releases its next model to know more. An open-source Llama 3 — if it compares at all with GPT-4 and Gemini performance-wise — will shed some light on how these models are built and what they’re trained on.

Just a final note on something that went mostly overlooked, Google DeepMind has also released AlphaCode 2 on top of Gemini. It solved 1.7× more problems than its predecessor, AlphaCode, and performed better than 85% of competition participants. This is relevant mostly for competitive programming, but worth mentioning here.

Both at the scientific and business levels this is probably the most important news. For the first time in almost a year, an AI model has surpassed GPT-4. Gemini Ultra has achieved SOTA on 30 out of 32 “widely-used academic benchmarks.” From the blog post:

With a score of 90.0%, Gemini Ultra is the first model to outperform human experts on MMLU (massive multitask language understanding), which uses a combination of 57 subjects such as math, physics, history, law, medicine and ethics for testing both world knowledge and problem-solving abilities… Gemini Ultra also achieves a state-of-the-art score of 59.4% on the new MMMU benchmark, which consists of multimodal tasks spanning different domains requiring deliberate reasoning.

Gemini Ultra surpasses GPT-4 in 17 out of 18 benchmarks shown below, including MMLU (90% vs 86.4%, using a new type of Chain of Thought approach) and the new multimodality benchmark MMMU (59.4% vs 56.8%). Interestingly, Gemini is not that much better than GPT-4. As I see it, this reveals how hard is to improve these systems more than, say, Google’s incapability to take on OpenAI. Here’s the comparison across those and other text and multimodality benchmarks:


If you want to know more about Gemini’s capabilities from real-world testing (e.g., reasoning and understanding, solving math and coding problems, etc.) I recommend you to watch the videos from Google DeepMind’s interactive blog post and this comprehensive demo that CEO Sundar Pichai published on X (both very much worth watching to understand better what the above numbers mean).

I think this is enough PR for Google on Gemini’s performance until we can truly test what’s capable of. I will leave this excerpt from the technical report’s conclusion here in case you have the wrong impression that Gemini has overcome all the problems that pain modern AI systems — hallucinations and high-level reasoning remain unsolved:

Despite their impressive capabilities, we should note that there are limitations to the use of LLMs. There is a continued need for ongoing research and development on “hallucinations” generated by LLMs to ensure that model outputs are more reliable and verifiable. LLMs also struggle with tasks requiring high-level reasoning abilities like causal understanding, logical deduction, and counterfactual reasoning even though they achieve impressive performance on exam benchmarks.

The word to note here is “natively”, but let’s do a recap on multimodality first. I wrote about the importance of multimodality very recently. A multimodal AI can work with different data types, in contrast to e.g., language models that only accept text as input and generate text as output. Here’s a brief explanation from my article:

To put in concrete terms what multimodality looks like in AI, let’s say that, on the weakest side of the spectrum, we have vision + language. DALL-E 3 (takes text as input and makes images as output) and GPT-4 (takes text or images as input and makes text) are prominent examples of weak multimodality. The strongest side remains unexplored, but in principle, AI could acquire every sensory modality that humans have (and more), including those that provide action capabilities (e.g., proprioception and a sense of equilibrium for robotics).

Gemini is, as of now, the strongest model on the multimodal spectrum, including text, code, images, audio, and video. From the technical report:

Gemini models are trained to accommodate textual input interleaved with a wide variety of audio and visual inputs, such as natural images, charts, screenshots, PDFs, and videos, and they can produce text and image outputs.

Multimodality is required to have a deeper understanding of the world. Some argue that language models develop internal world models when they try to predict the next word by processing statistical correlations in text data, but, if true, those are very limited. As scientists build models that can parse more info modalities, their internal representations grow richer — at the extreme, they’d match ours.

However, there are two different ways to build multimodal AI. Here’s where Gemini’s unique natively multimodal design shines through. The first way, which has been explored many times before, consists of adding different modules that are capable of processing different inputs/outputs. This works superficially but doesn’t provide the system with the means to encode a richer multimodal world model. Here’s what Demis Hassabis, Google DeepMind CEO, writes about this in the blog post:

Until now, the standard approach to creating multimodal models involved training separate components for different modalities and then stitching them together to roughly mimic some of this functionality. These models can sometimes be good at performing certain tasks, like describing images, but struggle with more conceptual and complex reasoning.

The second way, which presumably only Gemini has adopted, requires building the AI system as multimodal from the ground up. Gemini, unlike GPT-4, has been pre-trained and then fine-tuned on multimodal data — from the very beginning. Here’s Hassabis’ take on this new approach:

We designed Gemini to be natively multimodal, pre-trained from the start on different modalities. Then we fine-tuned it with additional multimodal data to further refine its effectiveness. This helps Gemini seamlessly understand and reason about all kinds of inputs from the ground up, far better than existing multimodal models — and its capabilities are state of the art in nearly every domain.

This second multimodality approach resembles much more how the human brain learns from multisensory contact with our multimodal world. If there’s a way to true general intelligence (or at least human-level intelligence, which is not the same as general), it’s through this kind of by-default multimodality. The video demo is a clear display of the impressive capabilities that native multimodality confers.

The next steps, as I argued in my recent article, are planning and robotics:

It’s a matter of time before AI companies develop systems that can see, listen, talk, create, move, plan, and make reasonable decisions using external information and knowledge to achieve a goal. Google DeepMind’s Gemini and OpenAI’s Q* are presumably steps in this direction (particularly by solving planning with learning and search).

Demis Hassabis, Google DeepMind CEO, confirmed to Casey Newton from Platformer that they are “thinking very heavily about … agent-based systems and planning systems.” In a conversation with Wired’s Will Knight, he repeated a similar vision for combining Gemini with robotics: “To become truly multimodal, you’d want to include touch and tactile feedback … There’s a lot of promise with applying these sort of foundation-type models to robotics, and we’re exploring that heavily.”

Feel free to take those comments as a high-level roadmap of what Google DeepMind plans to do in 2024.

Google has fulfilled its implicit promise: Gemini is better than GPT-4 across pretty much all benchmarks. That, by itself, makes it worth the many, many millions it may have cost; it’s the first time in four years that anyone has taken the lead from OpenAI. In any case, before we hype up Gemini too much, we should wait for Google to announce Bard Advanced in early 2024 to test it against GPT-4 Turbo and decide which one is better. Perhaps the right question to ask now is: Can Gemini improve over time faster than GPT thanks to its (unknown) architecture? But, of course, we don’t know the answer to that.

It’s notable that, if you look closely at the numbers reported on the benchmark evaluations, it was only by a few percentage points at most that Gemini beat GPT-4 (remember that GPT-4 finished training in 2022). I think this is evidence that it’s super hard to make models better with current approaches more than evidence that Google DeepMind’s researchers are “worse” than OpenAI’s — these two companies have the best AI talent in the world so this is, literally, the peak of what humanity can do now on AI. Should they start exploring other paradigms? I feel things are changing and we are close to saying goodbye to the hegemony of transformer-based language models.

Perhaps more notable — even if unsurprising — is Google DeepMind’s embracing of closeness (just like OpenAI and Anthropic). They shared nothing of value about the training or fine-tuning datasets and nothing of value about the architecture. This reveals that, in a strict sense, Gemini is much less a scientific project than a business product. That’s not bad per se (depends on whether you are a researcher or a user), just not really what DeepMind was all about. Just like Microsoft’s capture of OpenAI in 2019 forced them to pivot to production and product-market fit strategies, Google is leveraging DeepMind to that end to the same degree.

Back to science. Planning, agents, and robotics are next. I predict we will see a slower advance on those greater challenges in the coming months/years than we’ve seen on language modeling (remember Moravec’s paradox). Hassabis thinks Gemini will display capabilities we haven’t yet seen before, but I don’t think those will be true breakthroughs (in the large scheme of things) compared to what OpenAI already has. Hassabis talked with Newton about this, so I remain excited anyway: “I think we will see some new capabilities. This is part of what the Ultra testing is for. We’re kind of in beta — to safety-check it, responsibility-check it, but also to see how else it can be fine-tuned.”

Finally, although Sundar Pichai has branded this release as the beginning of the “Gemini era,” I think the true value for Google is recovering part of the trust they had lost year after year when an 800-employee startup had repeatedly managed to leave them behind. This is Google’s revenge on all the people who claimed that they couldn’t ship at all. This is its revenge on OpenAI and the impeccable marketing around ChatGPT and GPT-4. We will see if it works as well for them and, more importantly, how much it will last.

Original Post>