-
Approximating Language Model Training Data from Weights
Authors:
John X. Morris,
Junjie Oscar Yin,
Woojeong Kim,
Vitaly Shmatikov,
Alexander M. Rush
Abstract:
Modern language models often have open weights but closed training data. We formalize the problem of data approximation from model weights and propose several baselines and metrics. We develop a gradient-based approach that selects the highest-matching data from a large public text corpus and show its effectiveness at recovering useful data given only weights of the original and finetuned models.…
▽ More
Modern language models often have open weights but closed training data. We formalize the problem of data approximation from model weights and propose several baselines and metrics. We develop a gradient-based approach that selects the highest-matching data from a large public text corpus and show its effectiveness at recovering useful data given only weights of the original and finetuned models. Even when none of the true training data is known, our method is able to locate a small subset of public Web documents can be used to train a model to close to the original model performance given models trained for both classification and supervised-finetuning. On the AG News classification task, our method improves performance from 65% (using randomly selected data) to 80%, approaching the expert benchmark of 88%. When applied to a model trained with SFT on MSMARCO web documents, our method reduces perplexity from 3.3 to 2.3, compared to an expert LLAMA model's perplexity of 2.0.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Compute-Constrained Data Selection
Authors:
Junjie Oscar Yin,
Alexander M. Rush
Abstract:
Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the setting in which both the cost of selecting data and training are budgeted for. We first formalize the problem of data selection with a cost-aware utility func…
▽ More
Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the setting in which both the cost of selecting data and training are budgeted for. We first formalize the problem of data selection with a cost-aware utility function, and model the data selection problem as trading off initial-selection cost for training gain. We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute. Interestingly we find that many powerful data selection methods are almost never compute-optimal, and that cheaper data selection alternatives dominate both from a theoretical and empirical perspective. For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively.
△ Less
Submitted 7 April, 2025; v1 submitted 21 October, 2024;
originally announced October 2024.
-
Information criteria for efficient quantum state estimation
Authors:
J. O. S. Yin,
S. J. van Enk
Abstract:
Recently several more efficient versions of quantum state tomography have been proposed, with the purpose of making tomography feasible even for many-qubit states. The number of state parameters to be estimated is reduced by tentatively introducing certain simplifying assumptions on the form of the quantum state, and subsequently using the data to rigorously verify these assumptions. The simplifyi…
▽ More
Recently several more efficient versions of quantum state tomography have been proposed, with the purpose of making tomography feasible even for many-qubit states. The number of state parameters to be estimated is reduced by tentatively introducing certain simplifying assumptions on the form of the quantum state, and subsequently using the data to rigorously verify these assumptions. The simplifying assumptions considered so far were (i) the state can be well approximated to be of low rank, or (ii) the state can be well approximated as a matrix product state. We add one more method in that same spirit: we allow in principle any model for the state, using any (small) number of parameters (which can, e.g., be chosen to have a clear physical meaning), and the data are used to verify the model. The proof that this method is valid cannot be as strict as in above-mentioned cases, but is based on well-established statistical methods that go under the name of "information criteria." We exploit here, in particular, the Akaike Information Criterion (AIC). We illustrate the method by simulating experiments on (noisy) Dicke states.
△ Less
Submitted 30 March, 2011; v1 submitted 16 March, 2011;
originally announced March 2011.
-
Criteria for reliable entanglement quantification with finite data
Authors:
Jun O. S. Yin,
Steven J. van Enk
Abstract:
We propose one and a half criteria for determining how many measurements are needed to quantify entanglement reliably. We base these criteria on Bayesian analysis of measurement results, and apply our methods to four-qubit entanglement, but generalizations to more qubits are straightforward.
We propose one and a half criteria for determining how many measurements are needed to quantify entanglement reliably. We base these criteria on Bayesian analysis of measurement results, and apply our methods to four-qubit entanglement, but generalizations to more qubits are straightforward.
△ Less
Submitted 10 January, 2011; v1 submitted 11 November, 2010;
originally announced November 2010.
-
Entanglement verification with finite data
Authors:
Robin Blume-Kohout,
Jun O. S. Yin,
S. J. van Enk
Abstract:
Suppose an experimentalist wishes to verify that his apparatus produces entangled quantum states. A finite amount of data cannot conclusively demonstrate entanglement, so drawing conclusions from real-world data requires statistical reasoning. We propose a reliable method to quantify the weight of evidence for (or against) entanglement, based on a likelihood ratio test. Our method is universal…
▽ More
Suppose an experimentalist wishes to verify that his apparatus produces entangled quantum states. A finite amount of data cannot conclusively demonstrate entanglement, so drawing conclusions from real-world data requires statistical reasoning. We propose a reliable method to quantify the weight of evidence for (or against) entanglement, based on a likelihood ratio test. Our method is universal in that it can be applied to any sort of measurements. We demonstrate the method by applying it to two simulated experiments on two qubits. The first measures a single entanglement witness, while the second performs a tomographically complete measurement.
△ Less
Submitted 30 April, 2010;
originally announced May 2010.
-
Entanglement and purity of single- and two-photon states
Authors:
Jun O. S. Yin,
S. J. van Enk
Abstract:
Whereas single- and two-photon wave packets are usually treated as pure states, in practice they will be mixed. We study how entanglement created with mixed photon wave packets is degraded. We find in particular that the entanglement of a delocalized single-photon state of the electro-magnetic field is determined simply by its purity. We also discuss entanglement for two-photon mixed states, as…
▽ More
Whereas single- and two-photon wave packets are usually treated as pure states, in practice they will be mixed. We study how entanglement created with mixed photon wave packets is degraded. We find in particular that the entanglement of a delocalized single-photon state of the electro-magnetic field is determined simply by its purity. We also discuss entanglement for two-photon mixed states, as well as the influence of a vacuum component.
△ Less
Submitted 20 May, 2008; v1 submitted 6 March, 2008;
originally announced March 2008.