Search | arXiv e-print repository

Post-edits Are Preferences Too

Authors: Nathaniel Berger, Miriam Exel, Matthias Huck, Stefan Riezler

Abstract: Preference Optimization (PO) techniques are currently one of the state of the art techniques for fine-tuning large language models (LLMs) on pairwise preference feedback from human annotators. However, in machine translation, this sort of feedback can be difficult to solicit. Additionally, Kreutzer et al. (2018) have shown that, for machine translation, pairwise preferences are less reliable than… ▽ More Preference Optimization (PO) techniques are currently one of the state of the art techniques for fine-tuning large language models (LLMs) on pairwise preference feedback from human annotators. However, in machine translation, this sort of feedback can be difficult to solicit. Additionally, Kreutzer et al. (2018) have shown that, for machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings. We examine post-edits to see if they can be a source of reliable human preferences by construction. In PO, a human annotator is shown sequences $s_1$ and $s_2$ and asked for a preference judgment, %$s_1 > s_2$; while for post-editing, editors create $s_1$ and know that it should be better than $s_2$. We attempt to use these implicit preferences for PO and show that it helps the model move towards post-edit-like hypotheses and away from machine translation-like hypotheses. Furthermore, we show that best results are obtained by pre-training the model with supervised fine-tuning (SFT) on post-edits in order to promote post-edit-like hypotheses to the top output ranks. △ Less

Submitted 21 February, 2025; v1 submitted 3 October, 2024; originally announced October 2024.

Comments: To appear at the Ninth Conference on Machine Translation (WMT24)

arXiv:2406.02267 [pdf, ps, other]

Prompting Large Language Models with Human Error Markings for Self-Correcting Machine Translation

Authors: Nathaniel Berger, Stefan Riezler, Miriam Exel, Matthias Huck

Abstract: While large language models (LLMs) pre-trained on massive amounts of unpaired language data have reached the state-of-the-art in machine translation (MT) of general domain texts, post-editing (PE) is still required to correct errors and to enhance term translation quality in specialized domains. In this paper we present a pilot study of enhancing translation memories (TM) produced by PE (source se… ▽ More While large language models (LLMs) pre-trained on massive amounts of unpaired language data have reached the state-of-the-art in machine translation (MT) of general domain texts, post-editing (PE) is still required to correct errors and to enhance term translation quality in specialized domains. In this paper we present a pilot study of enhancing translation memories (TM) produced by PE (source segments, machine translations, and reference translations, henceforth called PE-TM) for the needs of correct and consistent term translation in technical domains. We investigate a light-weight two-step scenario where, at inference time, a human translator marks errors in the first translation step, and in a second step a few similar examples are extracted from the PE-TM to prompt an LLM. Our experiment shows that the additional effort of augmenting translations with human error markings guides the LLM to focus on a correction of the marked errors, yielding consistent improvements over automatic PE (APE) and MT from scratch. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: To appear at The 25th Annual Conference of the European Association for Machine Translation (EAMT 2024)

arXiv:2307.08416 [pdf, other]

Enhancing Supervised Learning with Contrastive Markings in Neural Machine Translation Training

Authors: Nathaniel Berger, Miriam Exel, Matthias Huck, Stefan Riezler

Abstract: Supervised learning in Neural Machine Translation (NMT) typically follows a teacher forcing paradigm where reference tokens constitute the conditioning context in the model's prediction, instead of its own previous predictions. In order to alleviate this lack of exploration in the space of translations, we present a simple extension of standard maximum likelihood estimation by a contrastive markin… ▽ More Supervised learning in Neural Machine Translation (NMT) typically follows a teacher forcing paradigm where reference tokens constitute the conditioning context in the model's prediction, instead of its own previous predictions. In order to alleviate this lack of exploration in the space of translations, we present a simple extension of standard maximum likelihood estimation by a contrastive marking objective. The additional training signals are extracted automatically from reference translations by comparing the system hypothesis against the reference, and used for up/down-weighting correct/incorrect tokens. The proposed new training procedure requires one additional translation pass over the training set per epoch, and does not alter the standard inference setup. We show that training with contrastive markings yields improvements on top of supervised learning, and is especially useful when learning from postedits where contrastive markings indicate human error corrections to the original hypotheses. Code is publicly released. △ Less

Submitted 17 July, 2023; originally announced July 2023.

Comments: Proceedings of the 24th Annual Conference of the European Association for Machine Translation, p. 69-78 Tampere, Finland, June 2023

arXiv:2206.11535 [pdf, other]

doi 10.1109/ISPDC55340.2022.00012

Online Event Selection for Mu3e using GPUs

Authors: Valentin Henkys, Bertil Schmidt, Niklaus Berger

Abstract: In the search for physics beyond the Standard Model the Mu3e experiment tries to observe the lepton flavor violating decay $μ^+ \rightarrow e^+ e^- e^+$. By observing the decay products of $1 \cdot 10^8μ$/s it aims to either observe the process, or set a new upper limit on its estimated branching ratio. The high muon rates result in high data rates of $80$\,Gbps, dominated by data produced through… ▽ More In the search for physics beyond the Standard Model the Mu3e experiment tries to observe the lepton flavor violating decay $μ^+ \rightarrow e^+ e^- e^+$. By observing the decay products of $1 \cdot 10^8μ$/s it aims to either observe the process, or set a new upper limit on its estimated branching ratio. The high muon rates result in high data rates of $80$\,Gbps, dominated by data produced through background processes. We present the Online Event Selection, a three step algorithm running on the graphics processing units (GPU) of the $12$ Mu3e filter farm computers. By using simple and fast geometric selection criteria, the algorithm first reduces the amount of possible event candidates to below $5\%$ of the initial set. These candidates are then used to reconstruct full particle tracks, correctly reconstructing over $97\%$ of signal tracks. Finally a possible decay vertex is reconstructed using simple geometric considerations instead of a full reconstruction, correctly identifying over $94\%$ of signal events. We also present a full implementation of the algorithm, fulfilling all performance requirements at the targeted muon rate and successfully reducing the data rate by a factor of $200$. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: 8 pages, to be published in IEEE ISPDC 2022 conference proceedings

arXiv:2110.12383 [pdf, other]

Automated Extraction of Sentencing Decisions from Court Cases in the Hebrew Language

Authors: Mohr Wenger, Tom Kalir, Noga Berger, Carmit Chalamish, Renana Keydar, Gabriel Stanovsky

Abstract: We present the task of Automated Punishment Extraction (APE) in sentencing decisions from criminal court cases in Hebrew. Addressing APE will enable the identification of sentencing patterns and constitute an important stepping stone for many follow up legal NLP applications in Hebrew, including the prediction of sentencing decisions. We curate a dataset of sexual assault sentencing decisions and… ▽ More We present the task of Automated Punishment Extraction (APE) in sentencing decisions from criminal court cases in Hebrew. Addressing APE will enable the identification of sentencing patterns and constitute an important stepping stone for many follow up legal NLP applications in Hebrew, including the prediction of sentencing decisions. We curate a dataset of sexual assault sentencing decisions and a manually-annotated evaluation dataset, and implement rule-based and supervised models. We find that while supervised models can identify the sentence containing the punishment with good accuracy, rule-based approaches outperform them on the full APE task. We conclude by presenting a first analysis of sentencing patterns in our dataset and analyze common models' errors, indicating avenues for future work, such as distinguishing between probation and actual imprisonment punishment. We will make all our resources available upon request, including data, annotation, and first benchmark models. △ Less

Submitted 24 October, 2021; originally announced October 2021.

Comments: Accepted to the Natural Legal Language Processing workshop (NLLP 2021), colocated with EMNLP 2021

arXiv:2109.07926 [pdf, other]

Don't Search for a Search Method -- Simple Heuristics Suffice for Adversarial Text Attacks

Authors: Nathaniel Berger, Stefan Riezler, Artem Sokolov, Sebastian Ebert

Abstract: Recently more attention has been given to adversarial attacks on neural networks for natural language processing (NLP). A central research topic has been the investigation of search algorithms and search constraints, accompanied by benchmark algorithms and tasks. We implement an algorithm inspired by zeroth order optimization-based attacks and compare with the benchmark results in the TextAttack f… ▽ More Recently more attention has been given to adversarial attacks on neural networks for natural language processing (NLP). A central research topic has been the investigation of search algorithms and search constraints, accompanied by benchmark algorithms and tasks. We implement an algorithm inspired by zeroth order optimization-based attacks and compare with the benchmark results in the TextAttack framework. Surprisingly, we find that optimization-based methods do not yield any improvement in a constrained setup and slightly benefit from approximate gradient information only in unconstrained setups where search spaces are larger. In contrast, simple heuristics exploiting nearest neighbors without querying the target function yield substantial success rates in constrained setups, and nearly full success rate in unconstrained setups, at an order of magnitude fewer queries. We conclude from these results that current TextAttack benchmark tasks are too easy and constraints are too strict, preventing meaningful research on black-box adversarial text attacks. △ Less

Submitted 4 October, 2021; v1 submitted 16 September, 2021; originally announced September 2021.

Comments: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP Main Conference)

arXiv:2006.01759 [pdf, other]

Sparse Perturbations for Improved Convergence in Stochastic Zeroth-Order Optimization

Authors: Mayumi Ohta, Nathaniel Berger, Artem Sokolov, Stefan Riezler

Abstract: Interest in stochastic zeroth-order (SZO) methods has recently been revived in black-box optimization scenarios such as adversarial black-box attacks to deep neural networks. SZO methods only require the ability to evaluate the objective function at random input points, however, their weakness is the dependency of their convergence speed on the dimensionality of the function to be evaluated. We pr… ▽ More Interest in stochastic zeroth-order (SZO) methods has recently been revived in black-box optimization scenarios such as adversarial black-box attacks to deep neural networks. SZO methods only require the ability to evaluate the objective function at random input points, however, their weakness is the dependency of their convergence speed on the dimensionality of the function to be evaluated. We present a sparse SZO optimization method that reduces this factor to the expected dimensionality of the random perturbation during learning. We give a proof that justifies this reduction for sparse SZO optimization for non-convex functions without making any assumptions on sparsity of objective function or gradient. Furthermore, we present experimental results for neural networks on MNIST and CIFAR that show faster convergence in training loss and test accuracy, and a smaller distance of the gradient approximation to the true gradient in sparse SZO compared to dense SZO. △ Less

Submitted 29 June, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

Comments: International Conference on Machine Learning, Optimization, and Data Science (LOD), Siena, Italy

Journal ref: LOD 2020

arXiv:2004.11222 [pdf, other]

Correct Me If You Can: Learning from Error Corrections and Markings

Authors: Julia Kreutzer, Nathaniel Berger, Stefan Riezler

Abstract: Sequence-to-sequence learning involves a trade-off between signal strength and annotation cost of training data. For example, machine translation data range from costly expert-generated translations that enable supervised learning, to weak quality-judgment feedback that facilitate reinforcement learning. We present the first user study on annotation cost and machine learnability for the less popul… ▽ More Sequence-to-sequence learning involves a trade-off between signal strength and annotation cost of training data. For example, machine translation data range from costly expert-generated translations that enable supervised learning, to weak quality-judgment feedback that facilitate reinforcement learning. We present the first user study on annotation cost and machine learnability for the less popular annotation mode of error markings. We show that error markings for translations of TED talks from English to German allow precise credit assignment while requiring significantly less human effort than correcting/post-editing, and that error-marked data can be used successfully to fine-tune neural machine translation models. △ Less

Submitted 23 April, 2020; originally announced April 2020.

Comments: To appear at EAMT 2020 (Research Track)

arXiv:1811.07767 [pdf, other]

Injecting and removing malignant features in mammography with CycleGAN: Investigation of an automated adversarial attack using neural networks

Authors: Anton S. Becker, Lukas Jendele, Ondrej Skopek, Nicole Berger, Soleen Ghafoor, Magda Marcon, Ender Konukoglu

Abstract: $\textbf{Purpose}$ To train a cycle-consistent generative adversarial network (CycleGAN) on mammographic data to inject or remove features of malignancy, and to determine whether these AI-mediated attacks can be detected by radiologists. $\textbf{Material and Methods}… ▽ More $\textbf{Purpose}$ To train a cycle-consistent generative adversarial network (CycleGAN) on mammographic data to inject or remove features of malignancy, and to determine whether these AI-mediated attacks can be detected by radiologists. $\textbf{Material and Methods}$ From the two publicly available datasets, BCDR and INbreast, we selected images from cancer patients and healthy controls. An internal dataset served as test data, withheld during training. We ran two experiments training CycleGAN on low and higher resolution images ($256 \times 256$ px and $512 \times 408$ px). Three radiologists read the images and rated the likelihood of malignancy on a scale from 1-5 and the likelihood of the image being manipulated. The readout was evaluated by ROC analysis (Area under the ROC curve = AUC). $\textbf{Results}$ At the lower resolution, only one radiologist exhibited markedly lower detection of cancer (AUC=0.85 vs 0.63, p=0.06), while the other two were unaffected (0.67 vs. 0.69 and 0.75 vs. 0.77, p=0.55). Only one radiologist could discriminate between original and modified images slightly better than guessing/chance (0.66, p=0.008). At the higher resolution, all radiologists showed significantly lower detection rate of cancer in the modified images (0.77-0.84 vs. 0.59-0.69, p=0.008), however, they were now able to reliably detect modified images due to better visibility of artifacts (0.92, 0.92 and 0.97). $\textbf{Conclusion}$ A CycleGAN can implicitly learn malignant features and inject or remove them so that a substantial proportion of small mammographic images would consequently be misdiagnosed. At higher resolutions, however, the method is currently limited and has a clear trade-off between manipulation of images and introduction of artifacts. △ Less

Submitted 19 November, 2018; originally announced November 2018.

Comments: To be presented at RSNA 2018

MSC Class: 68T45

arXiv:1108.5673 [pdf, ps, other]

doi 10.1063/1.3647201

Partial wave analysis at BES III harnessing the power of GPUs

Authors: Niklaus Berger

Abstract: Partial wave analysis is a core tool in hadron spectroscopy. With the high statistics data available at facilities such as the Beijing Spectrometer III, this procedure becomes computationally very expensive. We have successfully implemented a framework for performing partial wave analysis on graphics processors. We discuss the implementation, the parallel computing frameworks employed and the perf… ▽ More Partial wave analysis is a core tool in hadron spectroscopy. With the high statistics data available at facilities such as the Beijing Spectrometer III, this procedure becomes computationally very expensive. We have successfully implemented a framework for performing partial wave analysis on graphics processors. We discuss the implementation, the parallel computing frameworks employed and the performance achieved, with a focus on the recent transition to the OpenCL framework. △ Less

Submitted 29 August, 2011; originally announced August 2011.

Comments: 6 pages, 2 figures, prepared for the proceedings of Computing in High Energy Physics (CHEP) 2010

arXiv:cs/0701198 [pdf, ps, other]

Fitting the WHOIS Internet data

Authors: R. M. D'Souza, C. Borgs, J. T. Chayes, N. Berger, R. D. Kleinberg

Abstract: We consider the RIPE WHOIS Internet data as characterized by the Cooperative Association for Internet Data Analysis (CAIDA), and show that the Tempered Preferential Attachment model [1] provides an excellent fit to this data. [1] D'Souza, Borgs, Chayes, Berger and Kleinberg, to appear PNAS USA, 2007. We consider the RIPE WHOIS Internet data as characterized by the Cooperative Association for Internet Data Analysis (CAIDA), and show that the Tempered Preferential Attachment model [1] provides an excellent fit to this data. [1] D'Souza, Borgs, Chayes, Berger and Kleinberg, to appear PNAS USA, 2007. △ Less

Submitted 30 January, 2007; originally announced January 2007.

Comments: Supplemental information for "Emergence of Tempered Preferential Attachment From Optimization", to appear (open access) PNAS USA, 2007

arXiv:math/0611666 [pdf, ps, other]

doi 10.1214/07-AIHP126

Anomalous heat-kernel decay for random walk among bounded random conductances

Authors: Noam Berger, Marek Biskup, Christopher E. Hoffman, Gady Kozma

Abstract: We consider the nearest-neighbor simple random walk on $\Z^d$, $d\ge2$, driven by a field of bounded random conductances $ω_{xy}\in[0,1]$. The conductance law is i.i.d. subject to the condition that the probability of $ω_{xy}>0$ exceeds the threshold for bond percolation on $\Z^d$. For environments in which the origin is connected to infinity by bonds with positive conductances, we study the dec… ▽ More We consider the nearest-neighbor simple random walk on $\Z^d$, $d\ge2$, driven by a field of bounded random conductances $ω_{xy}\in[0,1]$. The conductance law is i.i.d. subject to the condition that the probability of $ω_{xy}>0$ exceeds the threshold for bond percolation on $\Z^d$. For environments in which the origin is connected to infinity by bonds with positive conductances, we study the decay of the $2n$-step return probability $P_ω^{2n}(0,0)$. We prove that $P_ω^{2n}(0,0)$ is bounded by a random constant times $n^{-d/2}$ in $d=2,3$, while it is $o(n^{-2})$ in $d\ge5$ and $O(n^{-2}\log n)$ in $d=4$. By producing examples with anomalous heat-kernel decay approaching $1/n^2$ we prove that the $o(n^{-2})$ bound in $d\ge5$ is the best possible. We also construct natural $n$-dependent environments that exhibit the extra $\log n$ factor in $d=4$. See also math.PR/0701248. △ Less

Submitted 26 June, 2007; v1 submitted 21 November, 2006; originally announced November 2006.

Comments: 22 pages. Includes a self-contained proof of isoperimetric inequality for supercritical percolation clusters. Version to appear in AIHP + additional corrections

MSC Class: 60G50; 58J35; 80A20

Journal ref: Ann. Inst. H. Poincare Probab. Statist. 274 (2008), no. 2, 374-392

arXiv:cond-mat/0502205 [pdf, ps, other]

Degree Distribution of Competition-Induced Preferential Attachment Graphs

Authors: N. Berger, C. Borgs, J. T. Chayes, R. M. D'Souza, R. D. Kleinberg

Abstract: We introduce a family of one-dimensional geometric growth models, constructed iteratively by locally optimizing the tradeoffs between two competing metrics, and show that this family is equivalent to a family of preferential attachment random graph models with upper cutoffs. This is the first explanation of how preferential attachment can arise from a more basic underlying mechanism of local com… ▽ More We introduce a family of one-dimensional geometric growth models, constructed iteratively by locally optimizing the tradeoffs between two competing metrics, and show that this family is equivalent to a family of preferential attachment random graph models with upper cutoffs. This is the first explanation of how preferential attachment can arise from a more basic underlying mechanism of local competition. We rigorously determine the degree distribution for the family of random graph models, showing that it obeys a power law up to a finite threshold and decays exponentially above this threshold. We also rigorously analyze a generalized version of our graph process, with two natural parameters, one corresponding to the cutoff and the other a ``fertility'' parameter. We prove that the general model has a power-law degree distribution up to a cutoff, and establish monotonicity of the power as a function of the two parameters. Limiting cases of the general model include the standard preferential attachment model without cutoff and the uniform attachment model. △ Less

Submitted 8 February, 2005; v1 submitted 8 February, 2005; originally announced February 2005.

Comments: 24 pages, one figure. To appear in the journal: Combinatorics, Probability and Computing. Note, this is a long version, with complete proofs, of the paper "Competition-Induced Preferential Attachment" (cond-mat/0402268)

arXiv:cond-mat/0402268 [pdf, ps, other]

Competition-Induced Preferential Attachment

Authors: N. Berger, C. Borgs, J. T. Chayes, R. M. D'Souza, R. D. Kleinberg

Abstract: Models based on preferential attachment have had much success in reproducing the power law degree distributions which seem ubiquitous in both natural and engineered systems. Here, rather than assuming preferential attachment, we give an explanation of how it can arise from a more basic underlying mechanism of competition between opposing forces. We introduce a family of one-dimensional geometr… ▽ More Models based on preferential attachment have had much success in reproducing the power law degree distributions which seem ubiquitous in both natural and engineered systems. Here, rather than assuming preferential attachment, we give an explanation of how it can arise from a more basic underlying mechanism of competition between opposing forces. We introduce a family of one-dimensional geometric growth models, constructed iteratively by locally optimizing the tradeoffs between two competing metrics. This family admits an equivalent description as a graph process with no reference to the underlying geometry. Moreover, the resulting graph process is shown to be preferential attachment with an upper cutoff. We rigorously determine the degree distribution for the family of random graph models, showing that it obeys a power law up to a finite threshold and decays exponentially above this threshold. We also introduce and rigorously analyze a generalized version of our graph process, with two natural parameters, one corresponding to the cutoff and the other a ``fertility'' parameter. Limiting cases of this process include the standard Barabasi-Albert preferential attachment model and the uniform attachment model. In the general case, we prove that the process has a power law degree distribution up to a cutoff, and establish monotonicity of the power as a function of the two parameters. △ Less

Submitted 10 February, 2004; originally announced February 2004.

Comments: Submitted to Intnl. Colloq. on Automata, Languages and Programming (ICALP 2004)

Journal ref: Proceedings of the 31st International Colloquium on Automata, Languages and Programming, 208-221 (2004).

Showing 1–14 of 14 results for author: Berger, N