-
$φ^{\infty}$: Clause Purification, Embedding Realignment, and the Total Suppression of the Em Dash in Autoregressive Language Models
Authors:
Bugra Kilictas,
Faruk Alpay
Abstract:
We identify a critical vulnerability in autoregressive transformer language models where the em dash token induces recursive semantic drift, leading to clause boundary hallucination and embedding space entanglement. Through formal analysis of token-level perturbations in semantic lattices, we demonstrate that em dash insertion fundamentally alters the model's latent representations, causing compou…
▽ More
We identify a critical vulnerability in autoregressive transformer language models where the em dash token induces recursive semantic drift, leading to clause boundary hallucination and embedding space entanglement. Through formal analysis of token-level perturbations in semantic lattices, we demonstrate that em dash insertion fundamentally alters the model's latent representations, causing compounding errors in long-form generation. We propose a novel solution combining symbolic clause purification via the phi-infinity operator with targeted embedding matrix realignment. Our approach enables total suppression of problematic tokens without requiring model retraining, while preserving semantic coherence through fixed-point convergence guarantees. Experimental validation shows significant improvements in generation consistency and topic maintenance. This work establishes a general framework for identifying and mitigating token-level vulnerabilities in foundation models, with immediate implications for AI safety, model alignment, and robust deployment of large language models in production environments. The methodology extends beyond punctuation to address broader classes of recursive instabilities in neural text generation systems.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
Recursive Semantic Anchoring in ISO 639:2023: A Structural Extension to ISO/TC 37 Frameworks
Authors:
Bugra Kilictas,
Faruk Alpay
Abstract:
ISO 639:2023 unifies the ISO language-code family and introduces contextual metadata, but it lacks a machine-native mechanism for handling dialectal drift and creole mixtures. We propose a formalisation of recursive semantic anchoring, attaching to every language entity $χ$ a family of fixed-point operators $φ_{n,m}$ that model bounded semantic drift via the relation $φ_{n,m}(χ) = χ\oplus Δ(χ)$, w…
▽ More
ISO 639:2023 unifies the ISO language-code family and introduces contextual metadata, but it lacks a machine-native mechanism for handling dialectal drift and creole mixtures. We propose a formalisation of recursive semantic anchoring, attaching to every language entity $χ$ a family of fixed-point operators $φ_{n,m}$ that model bounded semantic drift via the relation $φ_{n,m}(χ) = χ\oplus Δ(χ)$, where $Δ(χ)$ is a drift vector in a latent semantic manifold. The base anchor $φ_{0,0}$ recovers the canonical ISO 639:2023 identity, whereas $φ_{99,9}$ marks the maximal drift state that triggers a deterministic fallback. Using category theory, we treat the operators $φ_{n,m}$ as morphisms and drift vectors as arrows in a category $\mathrm{DriftLang}$. A functor $Φ: \mathrm{DriftLang} \to \mathrm{AnchorLang}$ maps every drifted object to its unique anchor and proves convergence. We provide an RDF/Turtle schema (\texttt{BaseLanguage}, \texttt{DriftedLanguage}, \texttt{ResolvedAnchor}) and worked examples -- e.g., $φ_{8,4}$ (Standard Mandarin) versus $φ_{8,7}$ (a colloquial variant), and $φ_{1,7}$ for Nigerian Pidgin anchored to English. Experiments with transformer models show higher accuracy in language identification and translation on noisy or code-switched input when the $φ$-indices are used to guide fallback routing. The framework is compatible with ISO/TC 37 and provides an AI-tractable, drift-aware semantic layer for future standards.
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
XiSort: Deterministic Sorting via IEEE-754 Total Ordering and Entropy Minimization
Authors:
Faruk Alpay
Abstract:
We introduce XiSort, a deterministic and reproducible sorting algorithm for floating-point sequences based on IEEE-754 total ordering and entropy minimization. XiSort guarantees bit-for-bit stability across runs and platforms by resolving tie-breaking via information-theoretic and symbolic methods. The algorithm supports both in-memory and external (out-of-core) operation, offering consistent perf…
▽ More
We introduce XiSort, a deterministic and reproducible sorting algorithm for floating-point sequences based on IEEE-754 total ordering and entropy minimization. XiSort guarantees bit-for-bit stability across runs and platforms by resolving tie-breaking via information-theoretic and symbolic methods. The algorithm supports both in-memory and external (out-of-core) operation, offering consistent performance on large datasets. We formalize a curved variant of the sorting metric that integrates into the Alpay Algebra framework, treating XiSort as a recursive operator with provable convergence and symbolic idempotence. This model preserves state-space closure while minimizing local disorder, interpretable as symbolic entropy. Empirical benchmarks demonstrate that XiSort achieves competitive throughput (e.g., sorting 10^8 doubles in approximately 12 seconds in-memory, and 100 GB at around 100 MB/s on SSDs), with applications in scientific computing, high-frequency finance, and reproducible numerical workflows. The results position XiSort as a principled tool for stable data alignment, symbolic preprocessing, and cross-platform float ordering.
Keywords: deterministic sorting, IEEE-754, entropy minimization, symbolic algebra, reproducibility, external memory, Alpay Algebra, data pipelines
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
Stable and Convexified Information Bottleneck Optimization via Symbolic Continuation and Entropy-Regularized Trajectories
Authors:
Faruk Alpay
Abstract:
The Information Bottleneck (IB) method frequently suffers from unstable optimization, characterized by abrupt representation shifts near critical points of the IB trade-off parameter, beta. In this paper, I introduce a novel approach to achieve stable and convex IB optimization through symbolic continuation and entropy-regularized trajectories. I analytically prove convexity and uniqueness of the…
▽ More
The Information Bottleneck (IB) method frequently suffers from unstable optimization, characterized by abrupt representation shifts near critical points of the IB trade-off parameter, beta. In this paper, I introduce a novel approach to achieve stable and convex IB optimization through symbolic continuation and entropy-regularized trajectories. I analytically prove convexity and uniqueness of the IB solution path when an entropy regularization term is included, and demonstrate how this stabilizes representation learning across a wide range of \b{eta} values. Additionally, I provide extensive sensitivity analyses around critical points (beta) with statistically robust uncertainty quantification (95% confidence intervals). The open-source implementation, experimental results, and reproducibility framework included in this work offer a clear path for practical deployment and future extension of my proposed method.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.