-
Gemma 3 Technical Report
Authors:
Gemma Team,
Aishwarya Kamath,
Johan Ferret,
Shreya Pathak,
Nino Vieillard,
Ramona Merhej,
Sarah Perrin,
Tatiana Matejovicova,
Alexandre Ramé,
Morgane Rivière,
Louis Rouillard,
Thomas Mesnard,
Geoffrey Cideron,
Jean-bastien Grill,
Sabela Ramos,
Edouard Yvinec,
Michelle Casbon,
Etienne Pot,
Ivo Penchev,
Gaël Liu,
Francesco Visin,
Kathleen Kenealy,
Lucas Beyer,
Xiaohai Zhai,
Anton Tsitsulin
, et al. (191 additional authors not shown)
Abstract:
We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achie…
▽ More
We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Kernel Banzhaf: A Fast and Robust Estimator for Banzhaf Values
Authors:
Yurong Liu,
R. Teal Witter,
Flip Korn,
Tarfah Alrashed,
Dimitris Paparas,
Christopher Musco,
Juliana Freire
Abstract:
Banzhaf values provide a popular, interpretable alternative to the widely-used Shapley values for quantifying the importance of features in machine learning models. Like Shapley values, computing Banzhaf values exactly requires time exponential in the number of features, necessitating the use of efficient estimators. Existing estimators, however, are limited to Monte Carlo sampling methods. In thi…
▽ More
Banzhaf values provide a popular, interpretable alternative to the widely-used Shapley values for quantifying the importance of features in machine learning models. Like Shapley values, computing Banzhaf values exactly requires time exponential in the number of features, necessitating the use of efficient estimators. Existing estimators, however, are limited to Monte Carlo sampling methods. In this work, we introduce Kernel Banzhaf, the first regression-based estimator for Banzhaf values. Our approach leverages a novel regression formulation, whose exact solution corresponds to the exact Banzhaf values. Inspired by the success of Kernel SHAP for Shapley values, Kernel Banzhaf efficiently solves a sampled instance of this regression problem. Through empirical evaluations across eight datasets, we find that Kernel Banzhaf significantly outperforms existing Monte Carlo methods in terms of accuracy, sample efficiency, robustness to noise, and feature ranking recovery. Finally, we complement our experimental evaluation with strong theoretical guarantees on Kernel Banzhaf's performance.
△ Less
Submitted 17 February, 2025; v1 submitted 10 October, 2024;
originally announced October 2024.
-
Scaling Laws for Downstream Task Performance in Machine Translation
Authors:
Berivan Isik,
Natalia Ponomareva,
Hussein Hazimeh,
Dimitris Paparas,
Sergei Vassilvitskii,
Sanmi Koyejo
Abstract:
Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we…
▽ More
Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by: downstream cross-entropy and translation quality metrics such as BLEU and COMET scores. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and translation quality scores improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream translation quality metrics with good accuracy using a log-law. However, there are cases where moderate misalignment causes the downstream translation scores to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these, we provide new practical insights for choosing appropriate pretraining data.
△ Less
Submitted 20 February, 2025; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Searching, Sorting, and Cake Cutting in Rounds
Authors:
Simina Brânzei,
Dimitris Paparas,
Nicholas Recker
Abstract:
We study searching and sorting in rounds motivated by a fair division question: given a cake cutting problem with $n$ players, compute a fair allocation in at most $k$ rounds of interaction with the players. Rounds interpolate between the simultaneous and the fully adaptive settings, also capturing parallel complexity. We find that proportional cake cutting in rounds is equivalent to sorting with…
▽ More
We study searching and sorting in rounds motivated by a fair division question: given a cake cutting problem with $n$ players, compute a fair allocation in at most $k$ rounds of interaction with the players. Rounds interpolate between the simultaneous and the fully adaptive settings, also capturing parallel complexity. We find that proportional cake cutting in rounds is equivalent to sorting with rank queries in rounds. We design a protocol for proportional cake cutting in rounds, while lower bounds for sorting in rounds with rank queries were given by Alon and Azar. Inspired by the rank query model, we then consider two basic search problems: ordered and unordered search.
In unordered search, we get an array $\vec{x}=(x_1, \ldots, x_n)$ and an element $z$ promised to be in $\vec{x}$. We have access to an oracle that receives queries of the form "Is $z$ at location $i$?" and answers "Yes" or "No". The goal is to find the location of $z$ with success probability at least $p$ in at most $k$ rounds of interaction with the oracle.
We show the expected query complexity of randomized algorithms on a worst case input is $np\bigl(\frac{k+1}{2k}\bigr) \pm O(1)$, while that of deterministic algorithms on a worst case input distribution is $np \bigl(1 - \frac{k-1}{2k}p \bigr) \pm O(1)$. These bounds apply even to fully adaptive unordered search, where the ratio between the two complexities converges to $2-p$ as the size of the array grows.
In ordered search, we get sorted array $\vec{x}=(x_1, \ldots, x_n)$ and element $z$ promised to be in $\vec{x}$. We have access to an oracle that gets comparison queries. Here we find that the expected query complexity of randomized algorithms on a worst case input and deterministic algorithms on a worst case input distribution is essentially the same: $p k \cdot n^{\frac{1}{k}} \pm O(1+pk)$.
△ Less
Submitted 19 November, 2023; v1 submitted 1 December, 2020;
originally announced December 2020.
-
On the Complexity of Simple and Optimal Deterministic Mechanisms for an Additive Buyer
Authors:
Xi Chen,
George Matikas,
Dimitris Paparas,
Mihalis Yannakakis
Abstract:
We show that the Revenue-Optimal Deterministic Mechanism Design problem for a single additive buyer is #P-hard, even when the distributions have support size 2 for each item and, more importantly, even when the optimal solution is guaranteed to be of a very simple kind: the seller picks a price for each individual item and a price for the grand bundle of all the items; the buyer can purchase eithe…
▽ More
We show that the Revenue-Optimal Deterministic Mechanism Design problem for a single additive buyer is #P-hard, even when the distributions have support size 2 for each item and, more importantly, even when the optimal solution is guaranteed to be of a very simple kind: the seller picks a price for each individual item and a price for the grand bundle of all the items; the buyer can purchase either the grand bundle at its given price or any subset of items at their total individual prices. The following problems are also #P-hard, as immediate corollaries of the proof:
1. determining if individual item pricing is optimal for a given instance,
2. determining if grand bundle pricing is optimal, and
3. computing the optimal (deterministic) revenue.
On the positive side, we show that when the distributions are i.i.d. with support size 2, the optimal revenue obtainable by any mechanism, even a randomized one, can be achieved by a simple solution of the above kind (individual item pricing with a discounted price for the grand bundle) and furthermore, it can be computed in polynomial time. The problem can be solved in polynomial time too when the number of items is constant.
△ Less
Submitted 14 July, 2017; v1 submitted 22 February, 2017;
originally announced February 2017.
-
The Complexity of Optimal Multidimensional Pricing
Authors:
Xi Chen,
Ilias Diakonikolas,
Dimitris Paparas,
Xiaorui Sun,
Mihalis Yannakakis
Abstract:
We resolve the complexity of revenue-optimal deterministic auctions in the unit-demand single-buyer Bayesian setting, i.e., the optimal item pricing problem, when the buyer's values for the items are independent. We show that the problem of computing a revenue-optimal pricing can be solved in polynomial time for distributions of support size 2, and its decision version is NP-complete for distribut…
▽ More
We resolve the complexity of revenue-optimal deterministic auctions in the unit-demand single-buyer Bayesian setting, i.e., the optimal item pricing problem, when the buyer's values for the items are independent. We show that the problem of computing a revenue-optimal pricing can be solved in polynomial time for distributions of support size 2, and its decision version is NP-complete for distributions of support size 3. We also show that the problem remains NP-complete for the case of identical distributions.
△ Less
Submitted 9 November, 2013;
originally announced November 2013.
-
The Complexity of Non-Monotone Markets
Authors:
Xi Chen,
Dimitris Paparas,
Mihalis Yannakakis
Abstract:
We introduce the notion of non-monotone utilities, which covers a wide variety of utility functions in economic theory. We then prove that it is PPAD-hard to compute an approximate Arrow-Debreu market equilibrium in markets with linear and non-monotone utilities. Building on this result, we settle the long-standing open problem regarding the computation of an approximate Arrow-Debreu market equili…
▽ More
We introduce the notion of non-monotone utilities, which covers a wide variety of utility functions in economic theory. We then prove that it is PPAD-hard to compute an approximate Arrow-Debreu market equilibrium in markets with linear and non-monotone utilities. Building on this result, we settle the long-standing open problem regarding the computation of an approximate Arrow-Debreu market equilibrium in markets with CES utility functions, by proving that it is PPAD-complete when the Constant Elasticity of Substitution parameter ρis any constant less than -1.
△ Less
Submitted 20 November, 2012;
originally announced November 2012.