What is Google Gemini?
Google Gemini is a family of multimodal artificial intelligence (AI) large language models that have capabilities in language, audio, code and video understanding.
Gemini 1.0 was announced on Dec. 6, 2023, and was built by Alphabet's Google DeepMind business unit, which is focused on advanced AI research and development. Google co-founder Sergey Brin is credited with helping develop the Gemini large language models (LLMs), alongside other Google staff.
At its release, Gemini was the most advanced set of LLMs at Google, superseding the company's Pathways Language Model (PaLM 2), which was released on May 10, 2023. As was the case with PaLM 2, Gemini is integrated into multiple Google technologies providing generative AI capabilities. Among the most visible user-facing examples of Gemini in action is the Google Bard AI chatbot, which was previously powered by PaLM 2.
Gemini integrates natural language processing capabilities, providing the ability to understand and process language, which is used to comprehend input queries, as well as data. It also has image understanding and recognition capabilities that enable parsing of complex visuals, such as charts and figures, without the need for external optical character recognition (OCR).
This article is part of
Gemini also has broad multilingual capabilities, enabling translation tasks, as well as functionality across different languages. For example, Gemini is capable of mathematical reasoning and summarization in multiple languages. It can also generate captions for an image in different languages.
Unlike prior models from Google, Gemini has native multimodality, meaning it's trained end to end on data sets spanning multiple data types. The multimodal nature of Gemini enables cross-modal reasoning abilities. That means Gemini can reason across a sequence of different input data types, including audio, images and text.
For example, the Gemini models can understand handwritten notes, graphs and diagrams to solve complex problems. The Gemini architecture supports directly ingesting text, images, audio waveforms and video frames as interleaved sequences.
At launch, Gemini was made up of a series of different model sizes, each designed for a specific set of use cases and deployment environments. The Ultra model is the top end and is designed for highly complex tasks. The Ultra model was not available at the same time as the initial Gemini launch, with Google targeting sometime in early 2024 for availability. The Pro model is designed for performance and deployment at scale. A version of Gemini Pro is used to power Google Bard. As of Dec. 13, 2023, Google enabled access to Gemini Pro in Google Cloud Vertex AI and Google AI Studio. For code, a version of the Gemini Pro model is being used to power the Google AlphaCode 2 generative AI coding technology.
The Nano model is targeted at on-device use cases. There are two different versions of Gemini Nano: Nano-1 is a 1.8 billion-parameter model, while Nano-2 is a 3.25 billion-parameter model. Among the places where Nano is being embedded is the Google Pixel 8 Pro smartphone.
Across all the Gemini models, Google has claimed that it has followed responsible development practices, including extensive evaluation to help limit the risk of bias and potential harms.
What can Gemini do?
The Google Gemini models are capable of many tasks across multiple modalities, including text, image, audio and video understanding. The multimodal nature of Gemini also enables different modalities to be combined to understand and generate an output.
Tasks that Gemini can do include the following:
- Text summarization. Gemini models can summarize content from different types of data.
- Text generation. Gemini can generate text based on a user prompt. That text can also be driven by a Q&A-type chatbot interface.
- Text translation. The Gemini models have broad multilingual capabilities, enabling translation and understanding of more than 100 languages.
- Image understanding. Gemini can parse complex visuals, such as charts, figures and diagrams, without external OCR tools. It can be used for image captioning and visual Q&A capabilities.
- Audio processing. Gemini has support for speech recognition across more than 100 languages and audio translation tasks.
- Video understanding. Gemini can process and understand video clip frames to answer questions and generate descriptions.
- Multimodal reasoning. A key strength of Gemini is multimodal reasoning, where different types of data can be mixed for a prompt to generate an output.
- Code analysis and generation. Gemini can understand, explain and generate code in popular programming languages, including Python, Java, C++ and Go.
How does Google Gemini work?
Google Gemini works by first being trained on a massive corpus of data. After training, the model uses several neural network techniques to be able to understand content, answer questions, generate text and produce outputs.
Specifically, the Gemini LLMs use a transformer model-based neural network architecture. The Gemini architecture has been enhanced to process lengthy contextual sequences across different data types, including text, audio and video. Google DeepMind made use of efficient attention mechanisms in the transformer decoder to help the models process long contexts, spanning different modalities.
Gemini models were trained on diverse multimodal and multilingual data sets of text, images, audio and video with Google DeepMind using advanced data filtering to optimize training. As different Gemini models are deployed in support of specific Google services, there is a process of targeted fine-tuning that can be used to further optimize a model for a use case.
During both the training and inference phases, Gemini benefits from the use of Google's latest TPUv5 chips, which are optimized custom AI accelerators designed to efficiently train and deploy large models.
A key challenge for LLMs is the risk of bias and potentially toxic content. According to Google, Gemini underwent extensive safety testing and mitigation around risks such as bias and toxicity to help provide a degree of LLM safety.
To help further ensure that Gemini works, the models were tested on academic benchmarks spanning language, image, audio, video and code domains.
Gemini vs. GPT-3 and GPT-4
|GPT-3 and GPT-4
|Multimodal; trained on text, images, audio and video
|Originally built as a text-only language model; GPT-4V enables visual input
|Size-based variations, including Ultra, Pro and Nano
|Optimizations for size, including GPT-3.5 Turbo and GPT-4 Turbo
|Context window length
Applications that use Gemini
Gemini was developed by Google as a foundation model and is widely integrated across various Google services. Gemini is also available for developers to use and build their own applications.
Applications that use Gemini include the following:
- Bard. Google's conversational AI service uses a fine-tuned version of Gemini Pro for advanced reasoning and chatbot capabilities.
- AlphaCode 2. Google DeepMind's AlphaCode 2 code generation tool makes use of a customized version of Gemini Pro.
- Google Pixel. The Google-built Pixel 8 Pro smartphones are the first devices engineered to run Gemini Nano on device. Gemini is used to power new features, such as summarization in Recorder and Smart Reply in Gboard for messaging apps.
- Android 14. Pixel 8 Pro is the first but won't be the only Android smartphone to benefit from Gemini. Android developers will be able to build with Gemini Nano through the AICore system capability.
- Vertex AI. Google Cloud's Vertex AI service, which provides foundation models that developers can use to build applications, provides access to Gemini Pro.
- Google AI Studio. Developers can prototype and build apps with Gemini via the Google AI Studio web-based tool.
- Search. Google is experimenting with using Gemini in its Search Generative Experience to reduce latency and improve quality.
Future of Gemini
As part of the initial launch of Gemini on Dec. 6, 2023, Google provided direction on the future of its next-generation LLMs.
The biggest piece of Gemini's future is the Gemini Ultra model, which was not made available at the same time as Gemini Pro and Gemini Nano. At launch, Google said Gemini Ultra would be made available to select customers, developers, partners and experts for early experimentation and feedback before a full rollout to developers and enterprises in early 2024.
Gemini Ultra will also be the foundation for what Google refers to as a Bard Advanced experience, which will be an updated, more powerful and capable version of the Bard chatbot.
The future of Gemini is also about a broader rollout and integrations across the Google portfolio. Gemini will find its way into the Google Chrome browser to help improve the web experience for users. Google has also pledged to integrate Gemini into the Google Ads platform, providing new ways for advertisers to connect with and engage users. The Duet AI assistant is also set to benefit from Gemini in the future.