-
The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications
Authors:
Mirac Suzgun,
Luke Melas-Kyriazi,
Suproteem K. Sarkar,
Scott Duke Kominers,
Stuart M. Shieber
Abstract:
Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Although the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, ML offers a promising set of techniques for evaluating novelty, summarizing contributions,…
▽ More
Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Although the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, ML offers a promising set of techniques for evaluating novelty, summarizing contributions, and embedding semantics. In this paper, we introduce the Harvard USPTO Patent Dataset (HUPD), a large-scale, well-structured, and multi-purpose corpus of English-language patent applications filed to the United States Patent and Trademark Office (USPTO) between 2004 and 2018. With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora. Unlike previously proposed patent datasets in NLP, HUPD contains the inventor-submitted versions of patent applications--not the final versions of granted patents--thereby allowing us to study patentability at the time of filing using NLP methods for the first time. It is also novel in its inclusion of rich structured metadata alongside the text of patent filings: By providing each application's metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks that leverage variation in structured covariates. As a case study on the types of research HUPD makes possible, we introduce a new task to the NLP community--namely, binary classification of patent decisions. We additionally show the structured metadata provided in the dataset enables us to conduct explicit studies of concept shifts for this task. Finally, we demonstrate how HUPD can be used for three additional tasks: multi-class classification of patent subject areas, language modeling, and summarization.
△ Less
Submitted 8 July, 2022;
originally announced July 2022.
-
ML4H Abstract Track 2020
Authors:
Emily Alsentzer,
Matthew B. A. McDermott,
Fabian Falck,
Suproteem K. Sarkar,
Subhrajit Roy,
Stephanie L. Hyland
Abstract:
A collection of the accepted abstracts for the Machine Learning for Health (ML4H) workshop at NeurIPS 2020. This index is not complete, as some accepted abstracts chose to opt-out of inclusion.
A collection of the accepted abstracts for the Machine Learning for Health (ML4H) workshop at NeurIPS 2020. This index is not complete, as some accepted abstracts chose to opt-out of inclusion.
△ Less
Submitted 19 November, 2020;
originally announced November 2020.
-
Robust Classification of Financial Risk
Authors:
Suproteem K. Sarkar,
Kojin Oshiba,
Daniel Giebisch,
Yaron Singer
Abstract:
Algorithms are increasingly common components of high-impact decision-making, and a growing body of literature on adversarial examples in laboratory settings indicates that standard machine learning models are not robust. This suggests that real-world systems are also susceptible to manipulation or misclassification, which especially poses a challenge to machine learning models used in financial s…
▽ More
Algorithms are increasingly common components of high-impact decision-making, and a growing body of literature on adversarial examples in laboratory settings indicates that standard machine learning models are not robust. This suggests that real-world systems are also susceptible to manipulation or misclassification, which especially poses a challenge to machine learning models used in financial services. We use the loan grade classification problem to explore how machine learning models are sensitive to small changes in user-reported data, using adversarial attacks documented in the literature and an original, domain-specific attack. Our work shows that a robust optimization algorithm can build models for financial services that are resistant to misclassification on perturbations. To the best of our knowledge, this is the first study of adversarial attacks and defenses for deep learning in financial services.
△ Less
Submitted 27 November, 2018;
originally announced November 2018.
-
Spread Spectrum based Robust Image Watermark Authentication
Authors:
T. S. Das,
V. H. Mankar,
S. K. Sarkar
Abstract:
In this paper, a new approach to Spread Spectrum (SS) watermarking technique is introduced. This problem is particularly interesting in the field of modern multimedia applications like internet when copyright protection of digital image is required. The approach exploits two-predecessor single attractor (TPSA) cellular automata (CA) suitability to work as efficient authentication function in wavel…
▽ More
In this paper, a new approach to Spread Spectrum (SS) watermarking technique is introduced. This problem is particularly interesting in the field of modern multimedia applications like internet when copyright protection of digital image is required. The approach exploits two-predecessor single attractor (TPSA) cellular automata (CA) suitability to work as efficient authentication function in wavelet based SS watermarking domain. The scheme is designed from the analytical study of state transition behaviour of non-group CA and the basic cryptography/encryption scheme is significantly different from the conventional SS data hiding approaches. Experimental studies confirm that the scheme is robust in terms of confidentiality, authentication, non-repudiation and integrity. The transform domain blind watermarking technique offers better visual & statistical imperceptibility and resiliency against different types of intentional & unintentional image degradations. Interleaving and interference cancellation methods are employed to improve the robustness performance significantly compared to conventional matched filter detection.
△ Less
Submitted 11 July, 2012;
originally announced July 2012.
-
Robust Image Watermarking Under Pixel Wise Masking Framework
Authors:
V. H. Mankar,
T. S. Das,
S. Saha,
S. K. Sarkar
Abstract:
The current paper presents a robust watermarking method for still images, which uses the similarity of discrete wavelet transform and human visual system (HVS). The proposed scheme makes the use of pixel wise masking in order to make binary watermark imperceptible to the HVS. The watermark is embedded in the perceptually significant, spatially selected detail coefficients using sub band adaptive t…
▽ More
The current paper presents a robust watermarking method for still images, which uses the similarity of discrete wavelet transform and human visual system (HVS). The proposed scheme makes the use of pixel wise masking in order to make binary watermark imperceptible to the HVS. The watermark is embedded in the perceptually significant, spatially selected detail coefficients using sub band adaptive threshold scheme. The threshold is computed based on the statistical analysis of the wavelet coefficients. The watermark is embedded several times to achieve better robustness. Here, a new type of non-oblivious detection method is proposed. The improvement in robustness performance against different types of deliberate and non-intentional image impairments (lossy compression, scaling, cropping, filtering etc) is supported through experimental results. The reported result also shows improvement in visual and statistical invisibility of the hidden data. The proposed method is compared with a state of the art frequency based watermarking technique, highlighting its performance. This algorithmic architecture utilizes the existing allocated bandwidth in the data transmission channel in a more efficient manner.
△ Less
Submitted 11 July, 2012;
originally announced July 2012.
-
Discrete Chaotic Sequence based on Logistic Map in Digital Communications
Authors:
V. H. Mankar,
T. S. Das,
S. K. Sarkar
Abstract:
The chaotic systems have been found applications in diverse fields such as pseudo random number generator, coding, cryptography, spread spectrum (SS) communications etc. The inherent capability of generating a large space of PN sequences due to sensitive dependence on initial conditions has been the main reason for exploiting chaos in spread spectrum communication systems. This behaviour suggests…
▽ More
The chaotic systems have been found applications in diverse fields such as pseudo random number generator, coding, cryptography, spread spectrum (SS) communications etc. The inherent capability of generating a large space of PN sequences due to sensitive dependence on initial conditions has been the main reason for exploiting chaos in spread spectrum communication systems. This behaviour suggests that it is straightforward to generate a variety of initial condition induced PN sequences with nice statistical properties by quantising the output of an iterated chaotic map. In the present paper the study has been carried out for the feasibility and usefulness of chaotic sequence in SS based applications like communication and watermarking.
△ Less
Submitted 11 July, 2012;
originally announced July 2012.
-
Performance Evaluation of Spread Spectrum Watermarking using Error Control Coding
Authors:
T. S. Das,
V. H. Mankar,
S. K. Sarkar
Abstract:
This paper proposes an oblivious watermarking algorithm with blind detection approach for high volume data hiding in image signals. We present a detection reliable signal adaptive embedding scheme for multiple messages in selective sub-bands of wavelet (DWT) coefficients using direct sequence spread spectrum (DS-SS) modulation technique. Here the impact of volumetric distortion sources is analyzed…
▽ More
This paper proposes an oblivious watermarking algorithm with blind detection approach for high volume data hiding in image signals. We present a detection reliable signal adaptive embedding scheme for multiple messages in selective sub-bands of wavelet (DWT) coefficients using direct sequence spread spectrum (DS-SS) modulation technique. Here the impact of volumetric distortion sources is analyzed on the ability of analytical bounds in order to recover the watermark messages. In this context, the joint source-channel coding scheme has been employed to obtain the better control of the system robustness. This structure prevents the desynchronisation between encoder and decoder due to selective embedding. The experimental results obtained for Spread Spectrum (SS) transformed domain watermarking demonstrate the efficiency of the proposed system. This algorithmic architecture utilizes the existing allocated bandwidth in the data transmission channel in a more efficient manner.
△ Less
Submitted 11 July, 2012;
originally announced July 2012.
-
Multimedia Steganographic Scheme using Multiresolution Analysis
Authors:
Tirtha sankar Das,
Ayan K. Sau,
V. H. Mankar,
Subir K. Sarkar
Abstract:
Digital steganography or data hiding has emerged as a new area of research in connection to the communication in secured channel as well as intellectual property protection for multimedia signals. The redundancy in image representation can be exploited successfully to embed specified characteristic information with a good quality of imperceptibility. The hidden multimedia information will be commu…
▽ More
Digital steganography or data hiding has emerged as a new area of research in connection to the communication in secured channel as well as intellectual property protection for multimedia signals. The redundancy in image representation can be exploited successfully to embed specified characteristic information with a good quality of imperceptibility. The hidden multimedia information will be communicated to the authentic user through secured channel as a part of the data. This article deals with a transform domain, block-based and signal non-adaptive/adaptive technique for inserting multimedia signals into an RGB image. The robustness of the proposed method has been tested compared to the other transform domain techniques. Proposed algorithm also shows improvement in visual and statistical invisibility of the hidden information.
△ Less
Submitted 11 July, 2012;
originally announced July 2012.
-
WiMAX Based 60 GHz Millimeter-Wave Communication for Intelligent Transport System Applications
Authors:
Rabindranath Bera,
Subir Kumar Sarkar,
Bikash Sharma,
Samarendra Nath Sur,
Debasish Bhaskar,
Soumyasree Bera
Abstract:
With the successful worldwide deployment of 3rd generation mobile communication, security aspects are ensured partly. Researchers are now looking for 4G mobile for its deployment with high data rate, enhanced security and reliability so that world should look for CALM, Continuous Air interface for Long and Medium range communication. This CALM will be a reliable high data rate secured mobile commu…
▽ More
With the successful worldwide deployment of 3rd generation mobile communication, security aspects are ensured partly. Researchers are now looking for 4G mobile for its deployment with high data rate, enhanced security and reliability so that world should look for CALM, Continuous Air interface for Long and Medium range communication. This CALM will be a reliable high data rate secured mobile communication to be deployed for car to car communication (C2C) for safety application. This paper reviewed the WiMAX ,& 60 GHz RF carrier for C2C. The system is tested at SMIT laboratory with multimedia transmission and reception. With proper deployment of this 60 GHz system on vehicles, the existing commercial products for 802.11P will be required to be replaced or updated soon .
△ Less
Submitted 2 May, 2011;
originally announced May 2011.
-
Hybrid Scenario Based Performance Analysis of DSDV and DSR
Authors:
Koushik Majumder,
Subir Kumar Sarkar
Abstract:
The area of mobile ad hoc networking has received considerable attention of the research community in recent years. These networks have gained immense popularity primarily due to their infrastructure-less mode of operation which makes them a suitable candidate for deployment in emergency scenarios like relief operation, battlefield etc., where either the pre-existing infrastructure is totally dama…
▽ More
The area of mobile ad hoc networking has received considerable attention of the research community in recent years. These networks have gained immense popularity primarily due to their infrastructure-less mode of operation which makes them a suitable candidate for deployment in emergency scenarios like relief operation, battlefield etc., where either the pre-existing infrastructure is totally damaged or it is not possible to establish a new infrastructure quickly. However, MANETs are constrained due to the limited transmission range of the mobile nodes which reduces the total coverage area. Sometimes the infrastructure-less ad hoc network may be combined with a fixed network to form a hybrid network which can cover a wider area with the advantage of having less fixed infrastructure. In such a combined network, for transferring data, we need base stations which act as gateways between the wired and wireless domains. Due to the hybrid nature of these networks, routing is considered a challenging task. Several routing protocols have been proposed and tested under various traffic conditions. However, the simulations of such routing protocols usually do not consider the hybrid network scenario. In this work we have carried out a systematic performance study of the two prominent routing protocols: Destination Sequenced Distance Vector Routing (DSDV) and Dynamic Source Routing (DSR) protocols in the hybrid networking environment. We have analyzed the performance differentials on the basis of three metrics - packet delivery fraction, average end-to-end delay and normalized routing load under varying pause time with different number of sources using NS2 based simulation.
△ Less
Submitted 11 June, 2010; v1 submitted 7 June, 2010;
originally announced June 2010.
-
A Cost Effective RFID Based Customized DVD-ROM to Thwart Software Piracy
Authors:
Sudip Dogra,
Ritwik Ray,
Saustav Ghosh,
Debharshi Bhattacharya,
Subir Kr. Sarkar
Abstract:
Software piracy has been a very perilous adversary of the software based industry, from the very beginning of the development of the latter into a significant business. There has been no developed foolproof system that has been developed to appropriately tackle this vile issue. We have in our scheme tried to develop a way to embark upon this problem using a very recently developed technology of…
▽ More
Software piracy has been a very perilous adversary of the software based industry, from the very beginning of the development of the latter into a significant business. There has been no developed foolproof system that has been developed to appropriately tackle this vile issue. We have in our scheme tried to develop a way to embark upon this problem using a very recently developed technology of RFID.
△ Less
Submitted 2 November, 2009;
originally announced November 2009.
-
Wi-Fi, WiMax and WCDMA A comparative study based on Channel Impairments and Equalization method used
Authors:
Rabindranath Bera,
Sanjib Sil,
Sourav Dhar,
Subir K. Sarkar
Abstract:
In this paper we describe the channel impairments and equalization methods currently used in WiFi, WiMax and WCDMA. After a review of channel model for Intelligent Transportation System (ITS), we proposed an equalization method which will be useful for the estimation of strong multipath channel at a high velocity.
In this paper we describe the channel impairments and equalization methods currently used in WiFi, WiMax and WCDMA. After a review of channel model for Intelligent Transportation System (ITS), we proposed an equalization method which will be useful for the estimation of strong multipath channel at a high velocity.
△ Less
Submitted 9 March, 2009;
originally announced March 2009.