Skip to main content

Showing 1–15 of 15 results for author: Kargaran, A H

.
  1. arXiv:2506.01074  [pdf, ps, other

    cs.CL cs.PL cs.SE

    How Programming Concepts and Neurons Are Shared in Code Language Models

    Authors: Amir Hossein Kargaran, Yihong Liu, François Yvon, Hinrich Schütze

    Abstract: Several studies have explored the mechanisms of large language models (LLMs) in coding tasks, but most have focused on programming languages (PLs) in a monolingual setting. In this paper, we investigate the relationship between multiple PLs and English in the concept space of LLMs. We perform a few-shot translation task on 21 PL pairs using two Llama-based models. By decoding the embeddings of int… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: ACL Findings 2025

  2. arXiv:2505.14824  [pdf, ps, other

    cs.CL

    Tracing Multilingual Factual Knowledge Acquisition in Pretraining

    Authors: Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Felicia Körner, Ercong Nie, Barbara Plank, François Yvon, Hinrich Schütze

    Abstract: Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consistency throughout pretraining largely unexplored. In this work, we trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLM… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: preprint

  3. arXiv:2502.17355  [pdf, other

    cs.CL

    On Relation-Specific Neurons in Large Language Models

    Authors: Yihong Liu, Runsheng Chen, Lea Hirlimann, Ahmad Dawar Hakimi, Mingyang Wang, Amir Hossein Kargaran, Sascha Rothe, François Yvon, Hinrich Schütze

    Abstract: In large language models (LLMs), certain neurons can store distinct pieces of knowledge learned during pretraining. While knowledge typically appears as a combination of relations and entities, it remains unclear whether some neurons focus on a relation itself -- independent of any entity. We hypothesize such neurons detect a relation in the input text and guide generation involving such a relatio… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

    Comments: preprint

  4. arXiv:2410.23825  [pdf, other

    cs.CL cs.AI

    GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages

    Authors: Amir Hossein Kargaran, François Yvon, Hinrich Schütze

    Abstract: The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipel… ▽ More

    Submitted 3 March, 2025; v1 submitted 31 October, 2024; originally announced October 2024.

    Comments: NeurIPS 2024

  5. arXiv:2410.05873  [pdf, ps, other

    cs.CL cs.AI

    MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

    Authors: Amir Hossein Kargaran, Ali Modarressi, Nafiseh Nikeghbal, Jana Diesner, François Yvon, Hinrich Schütze

    Abstract: English-centric large language models (LLMs) often show strong multilingual capabilities. However, their multilingual performance remains unclear and is under-evaluated for many other languages. Most benchmarks for multilinguality focus on classic NLP tasks or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric… ▽ More

    Submitted 1 June, 2025; v1 submitted 8 October, 2024; originally announced October 2024.

    Comments: ACL Findings 2025

  6. arXiv:2409.17326  [pdf, other

    cs.CL

    How Transliterations Improve Crosslingual Alignment

    Authors: Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Ayyoob Imani, Orgest Xhelili, Haotian Ye, Chunlan Ma, François Yvon, Hinrich Schütze

    Abstract: Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives on both original and transliterated data can improve crosslingual alignment. This improvement further leads to better crosslingual transfer performance. However, it remains unclear how and why a better crosslingual alignment is achieved, as this technique only involves transliter… ▽ More

    Submitted 15 December, 2024; v1 submitted 25 September, 2024; originally announced September 2024.

    Comments: COLING 2025

  7. arXiv:2406.06263  [pdf, other

    cs.CL

    MaskLID: Code-Switching Language Identification through Iterative Masking

    Authors: Amir Hossein Kargaran, François Yvon, Hinrich Schütze

    Abstract: We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: ACL 2024

  8. GIRT-Model: Automated Generation of Issue Report Templates

    Authors: Nafiseh Nikeghbal, Amir Hossein Kargaran, Abbas Heydarnoori

    Abstract: Platforms such as GitHub and GitLab introduce Issue Report Templates (IRTs) to enable more effective issue management and better alignment with developer expectations. However, these templates are not widely adopted in most repositories, and there is currently no tool available to aid developers in generating them. In this work, we introduce GIRT-Model, an assistant language model that automatical… ▽ More

    Submitted 8 February, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

    Comments: Accepted to be published at the 21st IEEE/ACM International Conference on Mining Software Repositories (MSR 2024)

  9. GlotLID: Language Identification for Low-Resource Languages

    Authors: Amir Hossein Kargaran, Ayyoob Imani, François Yvon, Hinrich Schütze

    Abstract: Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide covera… ▽ More

    Submitted 2 July, 2024; v1 submitted 24 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023

  10. arXiv:2309.13320  [pdf, other

    cs.CL

    GlotScript: A Resource and Tool for Low Resource Writing System Identification

    Authors: Amir Hossein Kargaran, François Yvon, Hinrich Schütze

    Abstract: We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it ret… ▽ More

    Submitted 27 March, 2024; v1 submitted 23 September, 2023; originally announced September 2023.

    Comments: LREC-COLING 2024

  11. Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

    Authors: Ayyoob Imani, Peiqin Lin, Amir Hossein Kargaran, Silvia Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma, Helmut Schmid, André F. T. Martins, François Yvon, Hinrich Schütze

    Abstract: The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages an… ▽ More

    Submitted 26 May, 2023; v1 submitted 20 May, 2023; originally announced May 2023.

    Comments: ACL 2023

  12. arXiv:2303.09236  [pdf, other

    cs.SE

    GIRT-Data: Sampling GitHub Issue Report Templates

    Authors: Nafiseh Nikeghbal, Amir Hossein Kargaran, Abbas Heydarnoori, Hinrich Schütze

    Abstract: GitHub's issue reports provide developers with valuable information that is essential to the evolution of a software development project. Contributors can use these reports to perform software engineering tasks like submitting bugs, requesting features, and collaborating on ideas. In the initial versions of issue reports, there was no standard way of using them. As a result, the quality of issue r… ▽ More

    Submitted 21 March, 2023; v1 submitted 16 March, 2023; originally announced March 2023.

    Comments: Accepted to be published at the 20th IEEE/ACM International Conference on Mining Software Repositories (MSR 2023)

  13. arXiv:2303.04496  [pdf, other

    cs.CL cs.AI cs.HC

    MenuCraft: Interactive Menu System Design with Large Language Models

    Authors: Amir Hossein Kargaran, Nafiseh Nikeghbal, Abbas Heydarnoori, Hinrich Schütze

    Abstract: Menu system design for user interfaces is a challenging task involving many design options and various human factors. For example, one crucial factor that designers need to consider is the semantic and systematic relation of menu commands. However, capturing these relations can be challenging due to limited available resources. Large language models can be helpful in this regard, using their pre-t… ▽ More

    Submitted 6 July, 2024; v1 submitted 8 March, 2023; originally announced March 2023.

  14. arXiv:2004.14826  [pdf, other

    cs.CR cs.CY cs.LG stat.ML

    Wide-AdGraph: Detecting Ad Trackers with a Wide Dependency Chain Graph

    Authors: Amir Hossein Kargaran, Mohammad Sadegh Akhondzadeh, Mohammad Reza Heidarpour, Mohammad Hossein Manshaei, Kave Salamatian, Masoud Nejad Sattary

    Abstract: Websites use third-party ads and tracking services to deliver targeted ads and collect information about users that visit them. These services put users' privacy at risk, and that is why users' demand for blocking these services is growing. Most of the blocking solutions rely on crowd-sourced filter lists manually maintained by a large community of users. In this work, we seek to simplify the upda… ▽ More

    Submitted 10 May, 2021; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: 9 pages, 7 figures, To appear in the 13th ACM Web Science Conference 2021 (WebSci '21), June 2021

  15. Analytical Derivation and Comparison of Alarm Similarity Measures

    Authors: Amir Hossein Kargaran, Amir Neshastegaran, Iman Izadi, Ehsan Yazdian

    Abstract: An industrial process includes many devices, variables, and sub-processes that are physically or electronically interconnected. These interconnections imply some level of correlation between different process variables. Since most of the alarms in a process plant are defined on process variables, alarms are also correlated. However, this can be a nuisance to operators, for one fault might trigger… ▽ More

    Submitted 3 October, 2021; v1 submitted 23 March, 2020; originally announced March 2020.

    Comments: 6 pages, 6 figures. This work has been accepted to the 16th IFAC Symposium on Advanced Control of Chemical Processes as an open access article under the CC-BY-NC-ND license