No Dataset Needed for Downstream Knowledge Benchmarking: Response Dispersion Inversely Correlates with Accuracy on Domain-specific QA
Authors:
Robert L Simione II
Abstract:
This research seeks to obviate the need for creating QA datasets and grading (chatbot) LLM responses when comparing LLMs' knowledge in specific topic domains. This is done in an entirely end-user centric way without need for access to any inner workings of the LLM, so long as it can be prompted and given a random seed to create different generations to the same prompt. The paper does this by, for…
▽ More
This research seeks to obviate the need for creating QA datasets and grading (chatbot) LLM responses when comparing LLMs' knowledge in specific topic domains. This is done in an entirely end-user centric way without need for access to any inner workings of the LLM, so long as it can be prompted and given a random seed to create different generations to the same prompt. The paper does this by, for a given topic domain, defining the "response dispersion" of an LLM by repeatedly asking an LLM the same opinion question about that topic domain. Namely, the response dispersion is the count of singular values needed to explain 95% of the variance in the embedding matrix of the LLM's responses. It is found that the response dispersion is inversely correlated with accuracy on relevant QA evaluations (average spearman rank correlation stronger than -.59). A use-case analysis shows that when comparing two different LLMs on the same topic domain, comparing their response dispersion is a suitable replacement for comparing their QA accuracy between 74% and 89% of the time, the range depending on certain reasonable accuracy-difference tolerances that may be acceptable to an end-user in exchange for the labor being saved using response dispersion instead of QA accuracy for comparison. Two response embeddings are studied for creating the embedding matrix in this study, one is from OpenAI's APIs and one is a novel embedding, here named reference sentence similarity embeddings, that can be computed locally and performs very nearly as well in calculating response dispersion. Also in this research, a pre-existing dataset called the IRC-Wiki Trivia dataset, originally developed for trivia games, has been re-purposed, curated, and the curation, called IRC-WikiTriviaQA, is made available for the purpose of this research.
△ Less
Submitted 24 August, 2024;
originally announced August 2024.
Long-Term Progress and Behavior Complexification in Competitive Co-Evolution
Authors:
Luca Simione,
Stefano Nolfi
Abstract:
The possibility to use competitive evolutionary algorithms to generate long-term progress is normally prevented by the convergence on limit cycle dynamics in which the evolving agents keep progressing against their current competitors by periodically rediscovering solutions adopted previously over and over again. This leads to local but not to global progress, i.e. progress against all possible co…
▽ More
The possibility to use competitive evolutionary algorithms to generate long-term progress is normally prevented by the convergence on limit cycle dynamics in which the evolving agents keep progressing against their current competitors by periodically rediscovering solutions adopted previously over and over again. This leads to local but not to global progress, i.e. progress against all possible competitors. We propose a new competitive algorithm that produces long-term global progress by identifying and by filtering out opportunistic variations, i.e. variations leading to progress against current competitors and retrogression against other competitors. The efficacy of the method is validated on the co-evolution of predator and prey robots, a classic problem that has been used in other related researches. The accumulation of global progress over many generations leads to effective solutions that involve the production of rather articulated behaviors. The complexity of the behavior displayed by the evolving robots increases across generations although progresses in performance are not always accompanied by behavior complexification.
△ Less
Submitted 23 May, 2020; v1 submitted 18 September, 2019;
originally announced September 2019.