-
The 2021 Tokyo Olympics Multilingual News Article Dataset
Authors:
Erik Novak,
Erik Calcina,
Dunja Mladenić,
Marko Grobelnik
Abstract:
In this paper, we introduce a dataset of multilingual news articles covering the 2021 Tokyo Olympics. A total of 10,940 news articles were gathered from 1,918 different publishers, covering 1,350 sub-events of the 2021 Olympics, and published between July 1, 2021, and August 14, 2021. These articles are written in nine languages from different language families and in different scripts. To create…
▽ More
In this paper, we introduce a dataset of multilingual news articles covering the 2021 Tokyo Olympics. A total of 10,940 news articles were gathered from 1,918 different publishers, covering 1,350 sub-events of the 2021 Olympics, and published between July 1, 2021, and August 14, 2021. These articles are written in nine languages from different language families and in different scripts. To create the dataset, the raw news articles were first retrieved via a service that collects and analyzes news articles. Then, the articles were grouped using an online clustering algorithm, with each group containing articles reporting on the same sub-event. Finally, the groups were manually annotated and evaluated. The development of this dataset aims to provide a resource for evaluating the performance of multilingual news clustering algorithms, for which limited datasets are available. It can also be used to analyze the dynamics and events of the 2021 Tokyo Olympics from different perspectives. The dataset is available in CSV format and can be accessed from the CLARIN.SI repository.
△ Less
Submitted 13 February, 2025; v1 submitted 10 February, 2025;
originally announced February 2025.
-
How many continuous measurements are needed to learn a vector?
Authors:
David Krieg,
Erich Novak,
Mario Ullrich
Abstract:
One can recover vectors from $\mathbb{R}^m$ with arbitrary precision, using only $\lceil \log_2(m+1)\rceil +1$ continuous measurements that are chosen adaptively. This surprising result is explained and discussed, and we present applications to infinite-dimensional approximation problems.
One can recover vectors from $\mathbb{R}^m$ with arbitrary precision, using only $\lceil \log_2(m+1)\rceil +1$ continuous measurements that are chosen adaptively. This surprising result is explained and discussed, and we present applications to infinite-dimensional approximation problems.
△ Less
Submitted 9 December, 2024;
originally announced December 2024.
-
On the power of adaption and randomization
Authors:
David Krieg,
Erich Novak,
Mario Ullrich
Abstract:
We present bounds between different widths of convex subsets of Banach spaces, including Gelfand and Bernstein widths. Using this, and some relations between widths and minimal errors, we obtain bounds on the maximal gain of adaptive and randomized algorithms over non-adaptive, deterministic ones for approximating linear operators on convex sets. Our results also apply to the approximation of embe…
▽ More
We present bounds between different widths of convex subsets of Banach spaces, including Gelfand and Bernstein widths. Using this, and some relations between widths and minimal errors, we obtain bounds on the maximal gain of adaptive and randomized algorithms over non-adaptive, deterministic ones for approximating linear operators on convex sets. Our results also apply to the approximation of embeddings into the space of bounded functions based on function evaluations, i.e., to sampling recovery in the uniform norm. We conclude with a list of open problems.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Optimal Algorithms for Numerical Integration: Recent Results and Open Problems
Authors:
Erich Novak
Abstract:
We present recent results on optimal algorithms for numerical integration and several open problems. The paper has six parts:
1. Introduction
2. Lower Bounds
3. Universality
4. General Domains
5. iid Information
6. Concluding Remarks
We present recent results on optimal algorithms for numerical integration and several open problems. The paper has six parts:
1. Introduction
2. Lower Bounds
3. Universality
4. General Domains
5. iid Information
6. Concluding Remarks
△ Less
Submitted 13 July, 2023;
originally announced July 2023.
-
An AI-based Learning Companion Promoting Lifelong Learning Opportunities for All
Authors:
Maria Perez-Ortiz,
Erik Novak,
Sahan Bulathwela,
John Shawe-Taylor
Abstract:
Artifical Intelligence (AI) in Education has great potential for building more personalised curricula, as well as democratising education worldwide and creating a Renaissance of new ways of teaching and learning. We believe this is a crucial moment for setting the foundations of AI in education in the beginning of this Fourth Industrial Revolution. This report aims to synthesize how AI might chang…
▽ More
Artifical Intelligence (AI) in Education has great potential for building more personalised curricula, as well as democratising education worldwide and creating a Renaissance of new ways of teaching and learning. We believe this is a crucial moment for setting the foundations of AI in education in the beginning of this Fourth Industrial Revolution. This report aims to synthesize how AI might change (and is already changing) how we learn, as well as what technological features are crucial for these AI systems in education, with the end goal of starting this pressing dialogue of how the future of AI in education should unfold, engaging policy makers, engineers, researchers and obviously, teachers and learners. This report also presents the advances within the X5GON project, a European H2020 project aimed at building and deploying a cross-modal, cross-lingual, cross-cultural, cross-domain and cross-site personalised learning platform for Open Educational Resources (OER).
△ Less
Submitted 16 November, 2021;
originally announced December 2021.
-
PEEK: A Large Dataset of Learner Engagement with Educational Videos
Authors:
Sahan Bulathwela,
Maria Perez-Ortiz,
Erik Novak,
Emine Yilmaz,
John Shawe-Taylor
Abstract:
Educational recommenders have received much less attention in comparison to e-commerce and entertainment-related recommenders, even though efficient intelligent tutors have great potential to improve learning gains. One of the main challenges in advancing this research direction is the scarcity of large, publicly available datasets. In this work, we release a large, novel dataset of learners engag…
▽ More
Educational recommenders have received much less attention in comparison to e-commerce and entertainment-related recommenders, even though efficient intelligent tutors have great potential to improve learning gains. One of the main challenges in advancing this research direction is the scarcity of large, publicly available datasets. In this work, we release a large, novel dataset of learners engaging with educational videos in-the-wild. The dataset, named Personalised Educational Engagement with Knowledge Topics PEEK, is the first publicly available dataset of this nature. The video lectures have been associated with Wikipedia concepts related to the material of the lecture, thus providing a humanly intuitive taxonomy. We believe that granular learner engagement signals in unison with rich content representations will pave the way to building powerful personalization algorithms that will revolutionise educational and informational recommendation systems. Towards this goal, we 1) construct a novel dataset from a popular video lecture repository, 2) identify a set of benchmark algorithms to model engagement, and 3) run extensive experimentation on the PEEK dataset to demonstrate its value. Our experiments with the dataset show promise in building powerful informational recommender systems. The dataset and the support code is available publicly.
△ Less
Submitted 13 September, 2021; v1 submitted 3 September, 2021;
originally announced September 2021.
-
Algorithms and Complexity for Functions on General Domains
Authors:
Erich Novak
Abstract:
Error bounds and complexity bounds in numerical analysis and information-based complexity are often proved for functions that are defined on very simple domains, such as a cube, a torus, or a sphere. We study optimal error bounds for the approximation or integration of functions defined on $D_d \subset R^d$ and only assume that $D_d$ is a bounded Lipschitz domain. Some results are even more genera…
▽ More
Error bounds and complexity bounds in numerical analysis and information-based complexity are often proved for functions that are defined on very simple domains, such as a cube, a torus, or a sphere. We study optimal error bounds for the approximation or integration of functions defined on $D_d \subset R^d$ and only assume that $D_d$ is a bounded Lipschitz domain. Some results are even more general. We study three different concepts to measure the complexity: order of convergence, asymptotic constant, and explicit uniform bounds, i.e., bounds that hold for all $n$ (number of pieces of information) and all (normalized) domains. It is known for many problems that the order of convergence of optimal algorithms does not depend on the domain $D_d \subset R^d$. We present examples for which the following statements are true:
1) Also the asymptotic constant does not depend on the shape of $D_d$ or the imposed boundary values, it only depends on the volume of the domain.
2) There are explicit and uniform lower (or upper, respectively) bounds for the error that are only slightly smaller (or larger, respectively) than the asymptotic error bound.
△ Less
Submitted 13 January, 2020; v1 submitted 16 August, 2019;
originally announced August 2019.
-
The weight distribution of the self-dual $[128,64]$ polarity design code
Authors:
Masaaki Harada,
Ethan Novak,
Vladimir D. Tonchev
Abstract:
The weight distribution of the binary self-dual $[128,64]$ code being the extended code $C^{*}$ of the code $C$ spanned by the incidence vectors of the blocks of the polarity design in $PG(6,2)$ [11] is computed. It is shown also that $R(3,7)$ and $C^{*}$ have no self-dual $[128,64,d]$ neighbor with $d \in \{ 20, 24 \}$.
The weight distribution of the binary self-dual $[128,64]$ code being the extended code $C^{*}$ of the code $C$ spanned by the incidence vectors of the blocks of the polarity design in $PG(6,2)$ [11] is computed. It is shown also that $R(3,7)$ and $C^{*}$ have no self-dual $[128,64,d]$ neighbor with $d \in \{ 20, 24 \}$.
△ Less
Submitted 15 February, 2016;
originally announced February 2016.