-
Gemma 3 Technical Report
Authors:
Gemma Team,
Aishwarya Kamath,
Johan Ferret,
Shreya Pathak,
Nino Vieillard,
Ramona Merhej,
Sarah Perrin,
Tatiana Matejovicova,
Alexandre Ramé,
Morgane Rivière,
Louis Rouillard,
Thomas Mesnard,
Geoffrey Cideron,
Jean-bastien Grill,
Sabela Ramos,
Edouard Yvinec,
Michelle Casbon,
Etienne Pot,
Ivo Penchev,
Gaël Liu,
Francesco Visin,
Kathleen Kenealy,
Lucas Beyer,
Xiaohai Zhai,
Anton Tsitsulin
, et al. (191 additional authors not shown)
Abstract:
We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achie…
▽ More
We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Gemma 2: Improving Open Language Models at a Practical Size
Authors:
Gemma Team,
Morgane Riviere,
Shreya Pathak,
Pier Giuseppe Sessa,
Cassidy Hardin,
Surya Bhupatiraju,
Léonard Hussenot,
Thomas Mesnard,
Bobak Shahriari,
Alexandre Ramé,
Johan Ferret,
Peter Liu,
Pouya Tafti,
Abe Friesen,
Michelle Casbon,
Sabela Ramos,
Ravin Kumar,
Charline Le Lan,
Sammy Jerome,
Anton Tsitsulin,
Nino Vieillard,
Piotr Stanczyk,
Sertan Girgin,
Nikola Momchev,
Matt Hoffman
, et al. (173 additional authors not shown)
Abstract:
In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We al…
▽ More
In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.
△ Less
Submitted 2 October, 2024; v1 submitted 31 July, 2024;
originally announced August 2024.
-
Swift for TensorFlow: A portable, flexible platform for deep learning
Authors:
Brennan Saeta,
Denys Shabalin,
Marc Rasi,
Brad Larson,
Xihui Wu,
Parker Schuh,
Michelle Casbon,
Daniel Zheng,
Saleem Abdulrasool,
Aleksandr Efremov,
Dave Abrahams,
Chris Lattner,
Richard Wei
Abstract:
Swift for TensorFlow is a deep learning platform that scales from mobile devices to clusters of hardware accelerators in data centers. It combines a language-integrated automatic differentiation system and multiple Tensor implementations within a modern ahead-of-time compiled language oriented around mutable value semantics. The resulting platform has been validated through use in over 30 deep lea…
▽ More
Swift for TensorFlow is a deep learning platform that scales from mobile devices to clusters of hardware accelerators in data centers. It combines a language-integrated automatic differentiation system and multiple Tensor implementations within a modern ahead-of-time compiled language oriented around mutable value semantics. The resulting platform has been validated through use in over 30 deep learning models and has been employed across data center and mobile applications.
△ Less
Submitted 25 February, 2021;
originally announced February 2021.
-
Quantifying Temperature-dependent Substrate Loss in GaN-on-Si RF Technology
Authors:
Hareesh Chandrasekar,
Michael J. Uren,
Michael A. Casbon,
Hassan Hirshy,
Abdalla Eblabla,
Khaled Elgaid,
James W. Pomeroy,
Paul J. Tasker,
Martin Kuball
Abstract:
Intrinsic limits to temperature-dependent substrate loss for GaN-on-Si technology, due to the change in resistivity of the substrate with temperature, are evaluated using an experimentally validated device simulation framework. Effect of room temperature substrate resistivity on temperature-dependent CPW line loss at various operating frequency bands are then presented. CPW lines for GaN-on-high r…
▽ More
Intrinsic limits to temperature-dependent substrate loss for GaN-on-Si technology, due to the change in resistivity of the substrate with temperature, are evaluated using an experimentally validated device simulation framework. Effect of room temperature substrate resistivity on temperature-dependent CPW line loss at various operating frequency bands are then presented. CPW lines for GaN-on-high resistivity Si are shown to have a pronounced temperature-dependence for temperatures above 150°C and have lower substrate losses for frequencies above the X-band. On the other hand, GaN-on-low resistivity Si is shown to be more temperature-insensitive and have lower substrate losses than even HR-Si for lower operating frequencies. The effect of various CPW geometries on substrate loss is also presented to generalize the discussion. These results are expected to act as a benchmark for temperature dependent substrate loss in GaN-on-Si RF technology.
△ Less
Submitted 28 January, 2019;
originally announced January 2019.