-
HIV Client Perspectives on Digital Health in Malawi
Authors:
Lisa Orii,
Caryl Feldacker,
Jacqueline Madalitso Huwa,
Agness Thawani,
Evelyn Viola,
Christine Kiruthu-Kamamia,
Odala Sande,
Hannock Tweya,
Richard Anderson
Abstract:
eHealth has strong potential to advance HIV care in low- and middle-income countries. Given the sensitivity of HIV-related information and the risks associated with unintended HIV status disclosure, clients' privacy perceptions towards eHealth applications should be examined to develop client-centered technologies. Through focus group discussions with antiretroviral therapy (ART) clients from Ligh…
▽ More
eHealth has strong potential to advance HIV care in low- and middle-income countries. Given the sensitivity of HIV-related information and the risks associated with unintended HIV status disclosure, clients' privacy perceptions towards eHealth applications should be examined to develop client-centered technologies. Through focus group discussions with antiretroviral therapy (ART) clients from Lighthouse Trust, Malawi's public HIV care program, we explored perceptions of data security and privacy, including their understanding of data flow and their concerns about data confidentiality across several layers of data use. Our findings highlight the broad privacy concerns that affect ART clients' day-to-day choices, clients' trust in Malawi's health system, and their acceptance of, and familiarity with, point-of-care technologies used in HIV care. Based on our findings, we provide recommendations for building robust digital health systems in low- and middle-income countries with limited resources, nascent privacy regulations, and political will to take action to protect client data.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling
Authors:
Avijit Thawani,
Saurabh Ghanekar,
Xiaoyuan Zhu,
Jay Pujara
Abstract:
Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the limitations of such a tokenization strategy, particularly for documents not written in English and for representing numbers. On the other extreme, byte/character-lev…
▽ More
Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the limitations of such a tokenization strategy, particularly for documents not written in English and for representing numbers. On the other extreme, byte/character-level language models are much less restricted but suffer from increased sequence description lengths and a subsequent quadratic expansion in self-attention computation. Recent attempts to compress and limit these context lengths with fixed size convolutions is helpful but completely ignores the word boundary. This paper considers an alternative 'learn your tokens' scheme which utilizes the word boundary to pool bytes/characters into word representations, which are fed to the primary language model, before again decoding individual characters/bytes per word in parallel. We find that our moderately expressive and moderately fast end-to-end tokenizer outperform by over 300% both subwords and byte/character models over the intrinsic language modeling metric of next-word prediction across datasets. It particularly outshines on rare words, outperforming by a factor of 30! We extensively study the language modeling setup for all three categories of tokenizers and theoretically analyze how our end-to-end models can also be a strong trade-off in efficiency and robustness.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
Estimating Numbers without Regression
Authors:
Avijit Thawani,
Jay Pujara,
Ashwin Kalyan
Abstract:
Despite recent successes in language models, their ability to represent numbers is insufficient. Humans conceptualize numbers based on their magnitudes, effectively projecting them on a number line; whereas subword tokenization fails to explicitly capture magnitude by splitting numbers into arbitrary chunks. To alleviate this shortcoming, alternative approaches have been proposed that modify numbe…
▽ More
Despite recent successes in language models, their ability to represent numbers is insufficient. Humans conceptualize numbers based on their magnitudes, effectively projecting them on a number line; whereas subword tokenization fails to explicitly capture magnitude by splitting numbers into arbitrary chunks. To alleviate this shortcoming, alternative approaches have been proposed that modify numbers at various stages of the language modeling pipeline. These methods change either the (1) notation in which numbers are written (\eg scientific vs decimal), the (2) vocabulary used to represent numbers or the entire (3) architecture of the underlying language model, to directly regress to a desired number.
Previous work suggests that architectural change helps achieve state-of-the-art on number estimation but we find an insightful ablation: changing the model's vocabulary instead (\eg introduce a new token for numbers in range 10-100) is a far better trade-off. In the context of masked number prediction, a carefully designed tokenization scheme is both the simplest to implement and sufficient, \ie with similar performance to the state-of-the-art approach that requires making significant architectural changes. Finally, we report similar trends on the downstream task of numerical fact estimation (for Fermi Problems) and discuss reasons behind our findings.
△ Less
Submitted 9 October, 2023;
originally announced October 2023.
-
Representing Numbers in NLP: a Survey and a Vision
Authors:
Avijit Thawani,
Jay Pujara,
Pedro A. Szekely,
Filip Ilievski
Abstract:
NLP systems rarely give special consideration to numbers found in text. This starkly contrasts with the consensus in neuroscience that, in the brain, numbers are represented differently from words. We arrange recent NLP work on numeracy into a comprehensive taxonomy of tasks and methods. We break down the subjective notion of numeracy into 7 subtasks, arranged along two dimensions: granularity (ex…
▽ More
NLP systems rarely give special consideration to numbers found in text. This starkly contrasts with the consensus in neuroscience that, in the brain, numbers are represented differently from words. We arrange recent NLP work on numeracy into a comprehensive taxonomy of tasks and methods. We break down the subjective notion of numeracy into 7 subtasks, arranged along two dimensions: granularity (exact vs approximate) and units (abstract vs grounded). We analyze the myriad representational choices made by 18 previously published number encoders and decoders. We synthesize best practices for representing numbers in text and articulate a vision for holistic numeracy in NLP, comprised of design trade-offs and a unified evaluation.
△ Less
Submitted 24 March, 2021;
originally announced March 2021.
-
Data-Driven Discovery of Molecular Photoswitches with Multioutput Gaussian Processes
Authors:
Ryan-Rhys Griffiths,
Jake L. Greenfield,
Aditya R. Thawani,
Arian R. Jamasb,
Henry B. Moss,
Anthony Bourached,
Penelope Jones,
William McCorkindale,
Alexander A. Aldrick,
Matthew J. Fuchter Alpha A. Lee
Abstract:
Photoswitchable molecules display two or more isomeric forms that may be accessed using light. Separating the electronic absorption bands of these isomers is key to selectively addressing a specific isomer and achieving high photostationary states whilst overall red-shifting the absorption bands serves to limit material damage due to UV-exposure and increases penetration depth in photopharmacologi…
▽ More
Photoswitchable molecules display two or more isomeric forms that may be accessed using light. Separating the electronic absorption bands of these isomers is key to selectively addressing a specific isomer and achieving high photostationary states whilst overall red-shifting the absorption bands serves to limit material damage due to UV-exposure and increases penetration depth in photopharmacological applications. Engineering these properties into a system through synthetic design however, remains a challenge. Here, we present a data-driven discovery pipeline for molecular photoswitches underpinned by dataset curation and multitask learning with Gaussian processes. In the prediction of electronic transition wavelengths, we demonstrate that a multioutput Gaussian process (MOGP) trained using labels from four photoswitch transition wavelengths yields the strongest predictive performance relative to single-task models as well as operationally outperforming time-dependent density functional theory (TD-DFT) in terms of the wall-clock time for prediction. We validate our proposed approach experimentally by screening a library of commercially available photoswitchable molecules. Through this screen, we identified several motifs that displayed separated electronic absorption bands of their isomers, exhibited red-shifted absorptions, and are suited for information transfer and photopharmacological applications. Our curated dataset, code, as well as all models are made available at https://github.com/Ryan-Rhys/The-Photoswitch-Dataset
△ Less
Submitted 7 August, 2022; v1 submitted 28 June, 2020;
originally announced August 2020.