Users of Google developer tools will get the first glimpse of the cloud provider's next large language model, Gemini 2.0, as an experimental Flash version appears this month in Google AI Studio and the Gemini API.

Gemini 2.0 Flash will have three important new capabilities: additional output modalities, including audio and images; as well as support for native tool usage, which means the model understands how to use tools in a multi-stage workflow and what tools are beneficial to use when. Finally, the new Gemini model will feature a new multimodal bidirectional API that can output audio or text in an immediate response to audio or video prompts.

An experimental version of Gemini 2.0 Flash is now available in Google AI Studio and Gemini APIs. If Gemini 2.0 follows a similar roadmap to Gemini 1.0 and 1.5, a higher-scale Pro version will eventually follow. The new Gemini model already outperforms Gemini 1.5 Flash and Pro on standard benchmarks for AI accuracy at twice the speed, according to Google officials.

One user of Google's Vertex AI Studio, which will get access to Gemini 2.0 Flash via the Gemini API, said he's interested in testing out the new features, especially since they're becoming available in the manageably sized Flash version.

"What stands out is the emphasis on their Flash model, which is efficient and fast," said David Strauss, CTO at WebOps service provider Pantheon. "Most industry announcements focus on frontier models, which are great for showing the limits of AI capability but are inefficient to run at scale."

Google officials declined to disclose pricing for the new model. If Gemini 2.0 follows Gemini 1.5's development, Gemini 2.0 Flash would eventually become available as part of Google's free AI offerings, Strauss said.

Audio and video output from the new Gemini model, which previously supported only text responses, will allow developers to create new AI-driven application interfaces, such as voice-enabled assistants with visual aids. It will also be able to generate a mix of text and images in response to voice commands. These outputs will be "steerable," meaning developers can build on and refine outputs using conversational natural language.

Google product managers demonstrated the new Gemini model during a press briefing this week, conducting short audio and video conversations with the model to prompt various audio and text responses, including the text of a recipe with embedded AI-generated images.

Multimodal support within a single large language model (LLM) is still relatively rare, said Andy Thurai, an analyst at Constellation Research.

"The same model can understand text, code, audio, etc., and output different modalities based on need" with Gemini 2.0 Flash, he said. "Most other offerings do model switching based on need. While that is not a big deal, as a model gateway can … route requests to an appropriate model, for enterprises that strictly use one vetted model, this can be useful."