-
Towards a Foundation Model for Communication Systems
Authors:
Davide Buffelli,
Sowmen Das,
Yu-Wei Lin,
Sattar Vakili,
Chien-Yi Wang,
Masoud Attarifar,
Pritthijit Nath,
Da-shan Shiu
Abstract:
Artificial Intelligence (AI) has demonstrated unprecedented performance across various domains, and its application to communication systems is an active area of research. While current methods focus on task-specific solutions, the broader trend in AI is shifting toward large general models capable of supporting multiple applications. In this work, we take a step toward a foundation model for comm…
▽ More
Artificial Intelligence (AI) has demonstrated unprecedented performance across various domains, and its application to communication systems is an active area of research. While current methods focus on task-specific solutions, the broader trend in AI is shifting toward large general models capable of supporting multiple applications. In this work, we take a step toward a foundation model for communication data--a transformer-based, multi-modal model designed to operate directly on communication data. We propose methodologies to address key challenges, including tokenization, positional embedding, multimodality, variable feature sizes, and normalization. Furthermore, we empirically demonstrate that such a model can successfully estimate multiple features, including transmission rank, selected precoder, Doppler spread, and delay profile.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Improving Tropical Cyclone Forecasting With Video Diffusion Models
Authors:
Zhibo Ren,
Pritthijit Nath,
Pancham Shukla
Abstract:
Tropical cyclone (TC) forecasting is crucial for disaster preparedness and mitigation. While recent deep learning approaches have shown promise, existing methods often treat TC evolution as a series of independent frame-to-frame predictions, limiting their ability to capture long-term dynamics. We present a novel application of video diffusion models for TC forecasting that explicitly models tempo…
▽ More
Tropical cyclone (TC) forecasting is crucial for disaster preparedness and mitigation. While recent deep learning approaches have shown promise, existing methods often treat TC evolution as a series of independent frame-to-frame predictions, limiting their ability to capture long-term dynamics. We present a novel application of video diffusion models for TC forecasting that explicitly models temporal dependencies through additional temporal layers. Our approach enables the model to generate multiple frames simultaneously, better capturing cyclone evolution patterns. We introduce a two-stage training strategy that significantly improves individual-frame quality and performance in low-data regimes. Experimental results show our method outperforms the previous approach of Nath et al. by 19.3% in MAE, 16.2% in PSNR, and 36.1% in SSIM. Most notably, we extend the reliable forecasting horizon from 36 to 50 hours. Through comprehensive evaluation using both traditional metrics and Fréchet Video Distance (FVD), we demonstrate that our approach produces more temporally coherent forecasts while maintaining competitive single-frame quality. Code accessible at https://github.com/Ren-creater/forecast-video-diffmodels.
△ Less
Submitted 12 May, 2025; v1 submitted 27 January, 2025;
originally announced January 2025.
-
Neuromorphic Retina: An FPGA-based Emulator
Authors:
Prince Philip,
Pallab Kumar Nath,
Kapil Jainwal,
Andre van Schaik,
Chetan Singh Thakur
Abstract:
Implementing accurate models of the retina is a challenging task, particularly in the context of creating visual prosthetics and devices. Notwithstanding the presence of diverse artificial renditions of the retina, the imperative task persists to pursue a more realistic model. In this work, we are emulating a neuromorphic retina model on an FPGA. The key feature of this model is its powerful adapt…
▽ More
Implementing accurate models of the retina is a challenging task, particularly in the context of creating visual prosthetics and devices. Notwithstanding the presence of diverse artificial renditions of the retina, the imperative task persists to pursue a more realistic model. In this work, we are emulating a neuromorphic retina model on an FPGA. The key feature of this model is its powerful adaptation to luminance and contrast, which allows it to accurately emulate the sensitivity of the biological retina to changes in light levels. Phasic and tonic cells are realizable in the retina in the simplest way possible. Our FPGA implementation of the proposed biologically inspired digital retina, incorporating a receptive field with a center-surround structure, is reconfigurable and can support 128*128 pixel images at a frame rate of 200fps. It consumes 1720 slices, approximately 3.7k Look-Up Tables (LUTs), and Flip-Flops (FFs) on the FPGA. This implementation provides a high-performance, low-power, and small-area solution and could be a significant step forward in the development of biologically plausible retinal prostheses with enhanced information processing capabilities
△ Less
Submitted 15 January, 2025;
originally announced January 2025.
-
Estimating Atmospheric Variables from Digital Typhoon Satellite Images via Conditional Denoising Diffusion Models
Authors:
Zhangyue Ling,
Pritthijit Nath,
César Quilodrán-Casas
Abstract:
This study explores the application of diffusion models in the field of typhoons, predicting multiple ERA5 meteorological variables simultaneously from Digital Typhoon satellite images. The focus of this study is taken to be Taiwan, an area very vulnerable to typhoons. By comparing the performance of Conditional Denoising Diffusion Probability Model (CDDPM) with Convolutional Neural Networks (CNN)…
▽ More
This study explores the application of diffusion models in the field of typhoons, predicting multiple ERA5 meteorological variables simultaneously from Digital Typhoon satellite images. The focus of this study is taken to be Taiwan, an area very vulnerable to typhoons. By comparing the performance of Conditional Denoising Diffusion Probability Model (CDDPM) with Convolutional Neural Networks (CNN) and Squeeze-and-Excitation Networks (SENet), results suggest that the CDDPM performs best in generating accurate and realistic meteorological data. Specifically, CDDPM achieved a PSNR of 32.807, which is approximately 7.9% higher than CNN and 5.5% higher than SENet. Furthermore, CDDPM recorded an RMSE of 0.032, showing a 11.1% improvement over CNN and 8.6% improvement over SENet. A key application of this research can be for imputation purposes in missing meteorological datasets and generate additional high-quality meteorological data using satellite images. It is hoped that the results of this analysis will enable more robust and detailed forecasting, reducing the impact of severe weather events on vulnerable regions. Code accessible at https://github.com/TammyLing/Typhoon-forecasting.
△ Less
Submitted 17 October, 2024; v1 submitted 12 September, 2024;
originally announced September 2024.
-
RAIN: Reinforcement Algorithms for Improving Numerical Weather and Climate Models
Authors:
Pritthijit Nath,
Henry Moss,
Emily Shuckburgh,
Mark Webb
Abstract:
This study explores integrating reinforcement learning (RL) with idealised climate models to address key parameterisation challenges in climate science. Current climate models rely on complex mathematical parameterisations to represent sub-grid scale processes, which can introduce substantial uncertainties. RL offers capabilities to enhance these parameterisation schemes, including direct interact…
▽ More
This study explores integrating reinforcement learning (RL) with idealised climate models to address key parameterisation challenges in climate science. Current climate models rely on complex mathematical parameterisations to represent sub-grid scale processes, which can introduce substantial uncertainties. RL offers capabilities to enhance these parameterisation schemes, including direct interaction, handling sparse or delayed feedback, continuous online learning, and long-term optimisation. We evaluate the performance of eight RL algorithms on two idealised environments: one for temperature bias correction, another for radiative-convective equilibrium (RCE) imitating real-world computational constraints. Results show different RL approaches excel in different climate scenarios with exploration algorithms performing better in bias correction, while exploitation algorithms proving more effective for RCE. These findings support the potential of RL-based parameterisation schemes to be integrated into global climate models, improving accuracy and efficiency in capturing complex climate dynamics. Overall, this work represents an important first step towards leveraging RL to enhance climate model accuracy, critical for improving climate understanding and predictions. Code accessible at https://github.com/p3jitnath/climate-rl.
△ Less
Submitted 16 April, 2025; v1 submitted 28 August, 2024;
originally announced August 2024.
-
Bangladesh Agricultural Knowledge Graph: Enabling Semantic Integration and Data-driven Analysis--Full Version
Authors:
Rudra Pratap Deb Nath,
Tithi Rani Das,
Tonmoy Chandro Das,
S. M. Shafkat Raihan
Abstract:
In Bangladesh, agriculture is a crucial driver for addressing Sustainable Development Goal 1 (No Poverty) and 2 (Zero Hunger), playing a fundamental role in the economy and people's livelihoods. To enhance the sustainability and resilience of the agriculture industry through data-driven insights, the Bangladesh Bureau of Statistics and other organizations consistently collect and publish agricultu…
▽ More
In Bangladesh, agriculture is a crucial driver for addressing Sustainable Development Goal 1 (No Poverty) and 2 (Zero Hunger), playing a fundamental role in the economy and people's livelihoods. To enhance the sustainability and resilience of the agriculture industry through data-driven insights, the Bangladesh Bureau of Statistics and other organizations consistently collect and publish agricultural data on the Web. Nevertheless, the current datasets encounter various challenges: 1) they are presented in an unsustainable, static, read-only, and aggregated format, 2) they do not conform to the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles, and 3) they do not facilitate interactive analysis and integration with other data sources. In this paper, we present a thorough solution, delineating a systematic procedure for developing BDAKG: a knowledge graph that semantically and analytically integrates agriculture data in Bangladesh. BDAKG incorporates multidimensional semantics, is linked with external knowledge graphs, is compatible with OLAP, and adheres to the FAIR principles. Our experimental evaluation centers on evaluating the integration process and assessing the quality of the resultant knowledge graph in terms of completeness, timeliness, FAIRness, OLAP compatibility and data-driven analysis. Our federated data analysis recommend a strategic approach focused on decreasing CO$_2$ emissions, fostering economic growth, and promoting sustainable forestry.
△ Less
Submitted 19 March, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
ADL4D: Towards A Contextually Rich Dataset for 4D Activities of Daily Living
Authors:
Marsil Zakour,
Partha Pratim Nath,
Ludwig Lohmer,
Emre Faik Gökçe,
Martin Piccolrovazzi,
Constantin Patsch,
Yuankai Wu,
Rahul Chaudhari,
Eckehard Steinbach
Abstract:
Hand-Object Interactions (HOIs) are conditioned on spatial and temporal contexts like surrounding objects, previous actions, and future intents (for example, grasping and handover actions vary greatly based on objects proximity and trajectory obstruction). However, existing datasets for 4D HOI (3D HOI over time) are limited to one subject interacting with one object only. This restricts the genera…
▽ More
Hand-Object Interactions (HOIs) are conditioned on spatial and temporal contexts like surrounding objects, previous actions, and future intents (for example, grasping and handover actions vary greatly based on objects proximity and trajectory obstruction). However, existing datasets for 4D HOI (3D HOI over time) are limited to one subject interacting with one object only. This restricts the generalization of learning-based HOI methods trained on those datasets. We introduce ADL4D, a dataset of up to two subjects interacting with different sets of objects performing Activities of Daily Living (ADL) like breakfast or lunch preparation activities. The transition between multiple objects to complete a certain task over time introduces a unique context lacking in existing datasets. Our dataset consists of 75 sequences with a total of 1.1M RGB-D frames, hand and object poses, and per-hand fine-grained action annotations. We develop an automatic system for multi-view multi-hand 3D pose annotation capable of tracking hand poses over time. We integrate and test it against publicly available datasets. Finally, we evaluate our dataset on the tasks of Hand Mesh Recovery (HMR) and Hand Action Segmentation (HAS).
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
Forecasting Tropical Cyclones with Cascaded Diffusion Models
Authors:
Pritthijit Nath,
Pancham Shukla,
Shuai Wang,
César Quilodrán-Casas
Abstract:
As tropical cyclones become more intense due to climate change, the rise of Al-based modelling provides a more affordable and accessible approach compared to traditional methods based on mathematical models. This work leverages generative diffusion models to forecast cyclone trajectories and precipitation patterns by integrating satellite imaging, remote sensing, and atmospheric data. It employs a…
▽ More
As tropical cyclones become more intense due to climate change, the rise of Al-based modelling provides a more affordable and accessible approach compared to traditional methods based on mathematical models. This work leverages generative diffusion models to forecast cyclone trajectories and precipitation patterns by integrating satellite imaging, remote sensing, and atmospheric data. It employs a cascaded approach that incorporates three main tasks: forecasting, super-resolution, and precipitation modelling. The training dataset includes 51 cyclones from six major tropical cyclone basins from January 2019 - March 2023. Experiments demonstrate that the final forecasts from the cascaded models show accurate predictions up to a 36-hour rollout, with excellent Structural Similarity (SSIM) and Peak-Singal-To-Noise Ratio (PSNR) values exceeding 0.5 and 20 dB, respectively, for all three tasks. The 36-hour forecasts can be produced in as little as 30 mins on a single Nvidia A30/RTX 2080 Ti. This work also highlights the promising efficiency of Al methods such as diffusion models for high-performance needs in weather forecasting, such as tropical cyclone forecasting, while remaining computationally affordable, making them ideal for highly vulnerable regions with critical forecasting needs and financial limitations. Code accessible at https://github.com/nathzi1505/forecast-diffmodels.
△ Less
Submitted 30 July, 2024; v1 submitted 2 October, 2023;
originally announced October 2023.
-
Sampling - Variational Auto Encoder - Ensemble: In the Quest of Explainable Artificial Intelligence
Authors:
Sarit Maitra,
Vivek Mishra,
Pratima Verma,
Manav Chopra,
Priyanka Nath
Abstract:
Explainable Artificial Intelligence (XAI) models have recently attracted a great deal of interest from a variety of application sectors. Despite significant developments in this area, there are still no standardized methods or approaches for understanding AI model outputs. A systematic and cohesive framework is also increasingly necessary to incorporate new techniques like discriminative and gener…
▽ More
Explainable Artificial Intelligence (XAI) models have recently attracted a great deal of interest from a variety of application sectors. Despite significant developments in this area, there are still no standardized methods or approaches for understanding AI model outputs. A systematic and cohesive framework is also increasingly necessary to incorporate new techniques like discriminative and generative models to close the gap. This paper contributes to the discourse on XAI by presenting an empirical evaluation based on a novel framework: Sampling - Variational Auto Encoder (VAE) - Ensemble Anomaly Detection (SVEAD). It is a hybrid architecture where VAE combined with ensemble stacking and SHapley Additive exPlanations are used for imbalanced classification. The finding reveals that combining ensemble stacking, VAE, and SHAP can. not only lead to better model performance but also provide an easily explainable framework. This work has used SHAP combined with Permutation Importance and Individual Conditional Expectations to create a powerful interpretability of the model. The finding has an important implication in the real world, where the need for XAI is paramount to boost confidence in AI applications.
△ Less
Submitted 24 September, 2023;
originally announced September 2023.
-
Multiplierless In-filter Computing for tinyML Platforms
Authors:
Abhishek Ramdas Nair,
Pallab Kumar Nath,
Shantanu Chakrabartty,
Chetan Singh Thakur
Abstract:
Wildlife conservation using continuous monitoring of environmental factors and biomedical classification, which generate a vast amount of sensor data, is a challenge due to limited bandwidth in the case of remote monitoring. It becomes critical to have classification where data is generated, and only classified data is used for monitoring. We present a novel multiplierless framework for in-filter…
▽ More
Wildlife conservation using continuous monitoring of environmental factors and biomedical classification, which generate a vast amount of sensor data, is a challenge due to limited bandwidth in the case of remote monitoring. It becomes critical to have classification where data is generated, and only classified data is used for monitoring. We present a novel multiplierless framework for in-filter acoustic classification using Margin Propagation (MP) approximation used in low-power edge devices deployable in remote areas with limited connectivity. The entire design of this classification framework is based on template-based kernel machine, which include feature extraction and inference, and uses basic primitives like addition/subtraction, shift, and comparator operations, for hardware implementation. Unlike full precision training methods for traditional classification, we use MP-based approximation for training, including backpropagation mitigating approximation errors. The proposed framework is general enough for acoustic classification. However, we demonstrate the hardware friendliness of this framework by implementing a parallel Finite Impulse Response (FIR) filter bank in a kernel machine classifier optimized for a Field Programmable Gate Array (FPGA). The FIR filter acts as the feature extractor and non-linear kernel for the kernel machine implemented using MP approximation and a downsampling method to reduce the order of the filters. The FPGA implementation on Spartan 7 shows that the MP-approximated in-filter kernel machine is more efficient than traditional classification frameworks with just less than 1K slices.
△ Less
Submitted 24 April, 2023;
originally announced April 2023.
-
ActKnow: Active External Knowledge Infusion Learning for Question Answering in Low Data Regime
Authors:
K. M. Annervaz,
Pritam Kumar Nath,
Ambedkar Dukkipati
Abstract:
Deep learning models have set benchmark results in various Natural Language Processing tasks. However, these models require an enormous amount of training data, which is infeasible in many practical problems. While various techniques like domain adaptation, fewshot learning techniques address this problem, we introduce a new technique of actively infusing external knowledge into learning to solve…
▽ More
Deep learning models have set benchmark results in various Natural Language Processing tasks. However, these models require an enormous amount of training data, which is infeasible in many practical problems. While various techniques like domain adaptation, fewshot learning techniques address this problem, we introduce a new technique of actively infusing external knowledge into learning to solve low data regime problems. We propose a technique called ActKnow that actively infuses knowledge from Knowledge Graphs (KG) based "on-demand" into learning for Question Answering (QA). By infusing world knowledge from Concept-Net, we show significant improvements on the ARC Challenge-set benchmark over purely text-based transformer models like RoBERTa in the low data regime. For example, by using only 20% training examples, we demonstrate a 4% improvement in the accuracy for both ARC-challenge and OpenBookQA, respectively.
△ Less
Submitted 17 December, 2021;
originally announced December 2021.
-
Multiplierless MP-Kernel Machine For Energy-efficient Edge Devices
Authors:
Abhishek Ramdas Nair,
Pallab Kumar Nath,
Shantanu Chakrabartty,
Chetan Singh Thakur
Abstract:
We present a novel framework for designing multiplierless kernel machines that can be used on resource-constrained platforms like intelligent edge devices. The framework uses a piecewise linear (PWL) approximation based on a margin propagation (MP) technique and uses only addition/subtraction, shift, comparison, and register underflow/overflow operations. We propose a hardware-friendly MP-based in…
▽ More
We present a novel framework for designing multiplierless kernel machines that can be used on resource-constrained platforms like intelligent edge devices. The framework uses a piecewise linear (PWL) approximation based on a margin propagation (MP) technique and uses only addition/subtraction, shift, comparison, and register underflow/overflow operations. We propose a hardware-friendly MP-based inference and online training algorithm that has been optimized for a Field Programmable Gate Array (FPGA) platform. Our FPGA implementation eliminates the need for DSP units and reduces the number of LUTs. By reusing the same hardware for inference and training, we show that the platform can overcome classification errors and local minima artifacts that result from the MP approximation. The implementation of this proposed multiplierless MP-kernel machine on FPGA results in an estimated energy consumption of 13.4 pJ and power consumption of 107 mW with ~9k LUTs and FFs each for a 256 x 32 sized kernel making it superior in terms of power, performance, and area compared to other comparable implementations.
△ Less
Submitted 9 September, 2022; v1 submitted 3 June, 2021;
originally announced June 2021.
-
High-Level ETL for Semantic Data Warehouses -- Full Version
Authors:
Rudra Pratap Deb Nath,
Oscar Romero,
Torben Bach Pedersen,
Katja Hose
Abstract:
The popularity of the Semantic Web (SW) encourages organizations to organize and publish semantic data using the RDF model. This growth poses new requirements to Business Intelligence (BI) technologies to enable On-Line Analytical Processing (OLAP)-like analysis over semantic data. The incorporation of semantic data into a Data Warehouse (DW) is not supported by the traditional Extract-Transform-L…
▽ More
The popularity of the Semantic Web (SW) encourages organizations to organize and publish semantic data using the RDF model. This growth poses new requirements to Business Intelligence (BI) technologies to enable On-Line Analytical Processing (OLAP)-like analysis over semantic data. The incorporation of semantic data into a Data Warehouse (DW) is not supported by the traditional Extract-Transform-Load (ETL) tools because they do not consider semantic issues in the integration process. In this paper, we propose a layer-based integration process and a set of high-level RDF-based ETL constructs required to define, map, extract, process, transform, integrate, update, and load (multidimensional) semantic data. Different to other ETL tools, we automate the ETL data flows by creating metadata at the schema level. Therefore, it relieves ETL developers from the burden of manual mapping at the ETL operation level. We create a prototype, named Semantic ETL Construct (SETLCONSTRUCT), based on the innovative ETL constructs proposed here. To evaluate SETLCONSTRUCT, we create a multidimensional semantic DW by integrating a Danish Business dataset and an EU Subsidy dataset using it and compare it with the previous programmable framework SETLPROG in terms of productivity, development time and performance. The evaluation shows that 1) SETLCONSTRUCT uses 92% fewer Number of Typed Characters (NOTC) than SETLPROG, and SETLAUTO (the extension of SETLCONSTRUCT for generating ETL execution flow automatically) further reduces the Number of Used Concepts (NOUC) by another 25%; 2) using SETLCONSTRUCT, the development time is almost cut in half compared to SETLPROG, and is cut by another 27% using SETLAUTO; 3) SETLCONSTRUCT is scalable and has similar performance compared to SETLPROG.
△ Less
Submitted 12 June, 2020;
originally announced June 2020.
-
A visual search engine for Bangladeshi laws
Authors:
Manash Kumar Mandal,
Pinku Deb Nath,
Arpeeta Shams Mizan,
Nazmus Saquib
Abstract:
Browsing and finding relevant information for Bangladeshi laws is a challenge faced by all law students and researchers in Bangladesh, and by citizens who want to learn about any legal procedure. Some law archives in Bangladesh are digitized, but lack proper tools to organize the data meaningfully. We present a text visualization tool that utilizes machine learning techniques to make the searching…
▽ More
Browsing and finding relevant information for Bangladeshi laws is a challenge faced by all law students and researchers in Bangladesh, and by citizens who want to learn about any legal procedure. Some law archives in Bangladesh are digitized, but lack proper tools to organize the data meaningfully. We present a text visualization tool that utilizes machine learning techniques to make the searching of laws quicker and easier. Using Doc2Vec to layout law article nodes, link mining techniques to visualize relevant citation networks, and named entity recognition to quickly find relevant sections in long law articles, our tool provides a faster and better search experience to the users. Qualitative feedback from law researchers, students, and government officials show promise for visually intuitive search tools in the context of governmental, legal, and constitutional data in developing countries, where digitized data does not necessarily pave the way towards an easy access to information.
△ Less
Submitted 14 November, 2017;
originally announced November 2017.
-
A sum form functional equation on a closed domain and its role in information theory
Authors:
P. Nath,
D. K. Singh
Abstract:
This paper is devoted to finding the general solutions of the functional equation
$\sumin \sumjm h(p_iq_j)=\sumin h(p_i)+\sumjm k_j(q_j)+λ\sumin h(p_i)\sumjm k_j(q_j)$
valid for all complete probability distributions $(p_1,\ldots,p_n)$, $(q_1,\ldots,q_m)$, $0\le p_i\le 1$, $0\le q_j\le 1$, $i=1,\ldots,n$; $j=1,\ldots,m$, $\sumin p_i=1$, $\sumjm q_j=1$; $n\ge 3$, $m\ge 3$ fixed integers;…
▽ More
This paper is devoted to finding the general solutions of the functional equation
$\sumin \sumjm h(p_iq_j)=\sumin h(p_i)+\sumjm k_j(q_j)+λ\sumin h(p_i)\sumjm k_j(q_j)$
valid for all complete probability distributions $(p_1,\ldots,p_n)$, $(q_1,\ldots,q_m)$, $0\le p_i\le 1$, $0\le q_j\le 1$, $i=1,\ldots,n$; $j=1,\ldots,m$, $\sumin p_i=1$, $\sumjm q_j=1$; $n\ge 3$, $m\ge 3$ fixed integers; $λ\in\RR$, $λ\neq 0$ and the mappings $h:I\to\RR$, $k_j:I\to\RR$, $j=1,\ldots,m$; $I=[0,1]$, $\RR$ denoting the set of all real numbers. A special case of the above functional equation was treated earlier by L. Losonczi and Gy. Maksa.
△ Less
Submitted 24 August, 2015;
originally announced August 2015.
-
An Efficient Metric of Automatic Weight Generation for Properties in Instance Matching Technique
Authors:
Md. Hanif Seddiqui,
Rudra Pratap Deb Nath,
Masaki Aono
Abstract:
The proliferation of heterogeneous data sources of semantic knowledge base intensifies the need of an automatic instance matching technique. However, the efficiency of instance matching is often influenced by the weight of a property associated to instances. Automatic weight generation is a non-trivial, however an important task in instance matching technique. Therefore, identifying an appropriate…
▽ More
The proliferation of heterogeneous data sources of semantic knowledge base intensifies the need of an automatic instance matching technique. However, the efficiency of instance matching is often influenced by the weight of a property associated to instances. Automatic weight generation is a non-trivial, however an important task in instance matching technique. Therefore, identifying an appropriate metric for generating weight for a property automatically is nevertheless a formidable task. In this paper, we investigate an approach of generating weights automatically by considering hypotheses: (1) the weight of a property is directly proportional to the ratio of the number of its distinct values to the number of instances contain the property, and (2) the weight is also proportional to the ratio of the number of distinct values of a property to the number of instances in a training dataset. The basic intuition behind the use of our approach is the classical theory of information content that infrequent words are more informative than frequent ones. Our mathematical model derives a metric for generating property weights automatically, which is applied in instance matching system to produce re-conciliated instances efficiently. Our experiments and evaluations show the effectiveness of our proposed metric of automatic weight generation for properties in an instance matching technique.
△ Less
Submitted 12 February, 2015;
originally announced February 2015.