July 2, 2024
1 Solar System Way, Planet Earth, USA
Technology

Gemini's data analysis capabilities are not as good as Google claims

One of the selling points of Google's flagship generative AI models, Gemini 1.5 Pro and 1.5 Flashis the amount of data they can supposedly process and analyze. In press conferences and demos, Google has repeatedly claimed that the models can perform tasks that were previously impossible thanks to their “long context,” such as summarizing multiple documents of hundreds of pages or searching through scenes from movie footage.

But new research suggests that, in fact, the models aren't very good at those things.

Two separate studies They investigated how Gemini models from Google and others make sense of enormous amounts of data (think the length of “War and Peace”). They both concluded that Gemini 1.5 Pro and 1.5 Flash have difficulty correctly answering questions about large data sets; In a series of document-based tests, the models gave the correct answer only 40% 50% of the time.

“While models like Gemini 1.5 Pro can technically process long contexts, we've seen many cases that indicate the models don't actually 'understand' the content,” Marzena Karpinska, a postdoc at UMass Amherst and co-author of one of the studies, told TechCrunch.

Gemini context window missing

A model’s context, or context window, refers to the input data (e.g., text) that the model considers before generating output (e.g., additional text). A simple question — “Who won the 2020 US presidential election?” — can serve as context, as can the script of a movie, show, or audio clip. And as context windows grow, so does the size of documents that fit within them.

Newer versions of Gemini can accept more than 2 million tokens as context. (“Tokens” are subdivided bits of raw data, like the syllables “fan,” “tas,” and “tic” in the word “fantastic.”) That’s roughly equivalent to 1.4 million words, two hours of video, or 22 hours of audio — the largest context of any commercially available model.

In a briefing earlier this year, Google showed off several pre-recorded demos aimed at illustrating the potential of Gemini's long-context capabilities. In one of them, Gemini 1.5 Pro searched the transcript of the Apollo 11 Moon landing broadcast (about 402 pages) for quotes containing jokes and then found a scene in the broadcast that looked similar to a pencil sketch.

Google DeepMind vice president of research Oriol Vinyals, who led the briefing, described the model as “magical.”

“(1.5 Pro) performs these types of reasoning tasks on every page, on every word,” he said.

Maybe that was an exaggeration.

In one of the aforementioned studies comparing these capabilities, Karpinska, along with researchers at the Allen Institute for AI and Princeton, asked models to evaluate true or false statements about fiction books written in English. The researchers chose recent works so that the models couldn’t “cheat” by relying on prior knowledge, and they peppered the statements with references to specific details and plot points that would be impossible to understand without reading the books in their entirety.

Faced with a statement like “By using her abilities as Apoth, Nusis can reverse engineer the type of portal opened by the reagent key found in Rona's wooden chest,” Gemini 1.5 Pro and 1.5 Flash—having ingested the relevant book— They had to say whether the statement was true or false and explain their reasoning.

Image credits: University of Massachusetts Amherst

Tested on a book about 260,000 words (~520 pages) in length, the researchers found that 1.5 Pro correctly answered true/false statements 46.7% of the time, while Flash answered correctly only 20% of the time. the times. That means one coin is significantly better at answering questions about the book than Google's latest machine learning model. When averaging all baseline results, none of the models managed to achieve above-random probability in terms of accuracy in answering questions.

“We have noticed that the models have a harder time verifying claims that require considering larger portions of the book, or even the entire book, compared to claims that can be resolved by retrieving sentence-level evidence,” Karpinska said. “Qualitatively, we also observed that the models have difficulty verifying claims about implicit information that is clear to a human reader but is not explicitly expressed in the text.”

The second of the two studies, co-authored by researchers at UC Santa Barbara, tested the ability of Gemini 1.5 Flash (but not 1.5 Pro) to “reason about” videos — that is, search for and answer questions about the content of them.

The coauthors created a set of images (e.g., a photo of a birthday cake) along with questions for the model to answer about the objects depicted in the images (e.g., “What cartoon character is on this cake?”). To evaluate the models, they chose one of the images at random and inserted “distractor” images before and after it to create slideshow-like sequences of images.

Flash didn't work so well. In a test where the model transcribed six handwritten digits from a “slideshow” of 25 images, Flash got about 50% of the transcriptions correct. Accuracy dropped to around 30% with eight digits.

“On real image question answering tasks, it seems to be particularly difficult for all the models we tested,” Michael Saxon, a PhD student at UC Santa Barbara and one of the study’s co-authors, told TechCrunch. “That small amount of reasoning — recognizing that a number is in a frame and reading it — could be what’s breaking the model.”

Google overpromises with Gemini

Neither study has been peer-reviewed, nor do they test versions of Gemini 1.5 Pro and 1.5 Flash with 2 million token contexts (both tested versions with 1 million token contexts). And Flash is not intended to be as capable as Pro in terms of performance; Google advertises it as a low-cost alternative.

However, both Add fuel to the fire that Google has been over-promising (and under-delivering) with Gemini From the beginningNone of the models the researchers tested, including OpenAI's, GPT-4o and anthropic Claude's Sonnet 3.5performed well. But Google is the only model provider that is given contextual billing on its ads.

“There’s nothing wrong with simply stating ‘our model can accept X amount of tokens’ based on objective technical details,” Saxon said. “But the question is: what useful thing can you do with it?”

Generative AI, more broadly, is coming under increasing scrutiny as businesses (and investors) grow increasingly frustrated with the technology’s limitations.

in a pair of recent surveys of According to Boston Consulting Group, about half of respondents (all senior executives) said they do not expect generative AI to deliver substantial productivity gains and are concerned about the potential for errors and data compromise from generative AI-powered tools. PitchBook recently reported that for two consecutive quarters, early-stage generative AI dealmaking has declined, falling 76% from its peak in Q3 2023.

In the face of chatbots that summarize meetings and summon fictional details about people and AI search platforms that are basically plagiarism generators, customers are on the hunt for promising differentiators. Google, which has competed, sometimes clumsilyTo catch up with its generative AI rivals, it was desperate to make the Gemini context one of those differentiators.

But it seems that the bet was premature.

“We haven't found a way to actually demonstrate that 'reasoning' or 'understanding' is occurring in long papers, and basically every group that publishes these models is cobbling together their own ad hoc assessments to make these claims,” Karpinska said. “Without knowledge of how long context processing is implemented (and companies do not share these details), it is difficult to say how realistic these claims are.”

Google did not respond to a request for comment.

Both Saxon and Karpinska believe the antidotes to the overblown claims about generative AI are better benchmarks and, along the same lines, a greater emphasis on third-party critique. Saxon notes that one of the most common tests for long context (cited liberally by Google in its marketing materials), “the needle in the haystack,” only measures a model’s ability to retrieve particular information, such as names and numbers, from data sets, not answer complex questions about that information.

“All the scientists and most engineers who use these models essentially agree that our current baseline culture is broken,” Saxon said, “so it's important that the public understands to take these giant reports that “They contain numbers like 'general intelligence through benchmarks' with huge acceptance.” grain of salt.”

    Leave feedback about this

    • Quality
    • Price
    • Service

    PROS

    +
    Add Field

    CONS

    +
    Add Field
    Choose Image
    Choose Video
    X