Search | arXiv e-print repository

The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications

Authors: Mirac Suzgun, Luke Melas-Kyriazi, Suproteem K. Sarkar, Scott Duke Kominers, Stuart M. Shieber

Abstract: Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Although the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, ML offers a promising set of techniques for evaluating novelty, summarizing contributions,… ▽ More Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Although the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, ML offers a promising set of techniques for evaluating novelty, summarizing contributions, and embedding semantics. In this paper, we introduce the Harvard USPTO Patent Dataset (HUPD), a large-scale, well-structured, and multi-purpose corpus of English-language patent applications filed to the United States Patent and Trademark Office (USPTO) between 2004 and 2018. With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora. Unlike previously proposed patent datasets in NLP, HUPD contains the inventor-submitted versions of patent applications--not the final versions of granted patents--thereby allowing us to study patentability at the time of filing using NLP methods for the first time. It is also novel in its inclusion of rich structured metadata alongside the text of patent filings: By providing each application's metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks that leverage variation in structured covariates. As a case study on the types of research HUPD makes possible, we introduce a new task to the NLP community--namely, binary classification of patent decisions. We additionally show the structured metadata provided in the dataset enables us to conduct explicit studies of concept shifts for this task. Finally, we demonstrate how HUPD can be used for three additional tasks: multi-class classification of patent subject areas, language modeling, and summarization. △ Less

Submitted 8 July, 2022; originally announced July 2022.

Comments: Website: https://patentdataset.org/, GitHub Repository: https://github.com/suzgunmirac/hupd, Hugging Face Datasets: https://huggingface.co/datasets/HUPD/hupd

arXiv:2011.11554

ML4H Abstract Track 2020

Authors: Emily Alsentzer, Matthew B. A. McDermott, Fabian Falck, Suproteem K. Sarkar, Subhrajit Roy, Stephanie L. Hyland

Abstract: A collection of the accepted abstracts for the Machine Learning for Health (ML4H) workshop at NeurIPS 2020. This index is not complete, as some accepted abstracts chose to opt-out of inclusion. A collection of the accepted abstracts for the Machine Learning for Health (ML4H) workshop at NeurIPS 2020. This index is not complete, as some accepted abstracts chose to opt-out of inclusion. △ Less

Submitted 19 November, 2020; originally announced November 2020.

arXiv:1811.11079 [pdf, other]

Robust Classification of Financial Risk

Authors: Suproteem K. Sarkar, Kojin Oshiba, Daniel Giebisch, Yaron Singer

Abstract: Algorithms are increasingly common components of high-impact decision-making, and a growing body of literature on adversarial examples in laboratory settings indicates that standard machine learning models are not robust. This suggests that real-world systems are also susceptible to manipulation or misclassification, which especially poses a challenge to machine learning models used in financial s… ▽ More Algorithms are increasingly common components of high-impact decision-making, and a growing body of literature on adversarial examples in laboratory settings indicates that standard machine learning models are not robust. This suggests that real-world systems are also susceptible to manipulation or misclassification, which especially poses a challenge to machine learning models used in financial services. We use the loan grade classification problem to explore how machine learning models are sensitive to small changes in user-reported data, using adversarial attacks documented in the literature and an original, domain-specific attack. Our work shows that a robust optimization algorithm can build models for financial services that are resistant to misclassification on perturbations. To the best of our knowledge, this is the first study of adversarial attacks and defenses for deep learning in financial services. △ Less

Submitted 27 November, 2018; originally announced November 2018.

Comments: NIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services

arXiv:1207.2701 [pdf]

Spread Spectrum based Robust Image Watermark Authentication

Authors: T. S. Das, V. H. Mankar, S. K. Sarkar

Abstract: In this paper, a new approach to Spread Spectrum (SS) watermarking technique is introduced. This problem is particularly interesting in the field of modern multimedia applications like internet when copyright protection of digital image is required. The approach exploits two-predecessor single attractor (TPSA) cellular automata (CA) suitability to work as efficient authentication function in wavel… ▽ More In this paper, a new approach to Spread Spectrum (SS) watermarking technique is introduced. This problem is particularly interesting in the field of modern multimedia applications like internet when copyright protection of digital image is required. The approach exploits two-predecessor single attractor (TPSA) cellular automata (CA) suitability to work as efficient authentication function in wavelet based SS watermarking domain. The scheme is designed from the analytical study of state transition behaviour of non-group CA and the basic cryptography/encryption scheme is significantly different from the conventional SS data hiding approaches. Experimental studies confirm that the scheme is robust in terms of confidentiality, authentication, non-repudiation and integrity. The transform domain blind watermarking technique offers better visual & statistical imperceptibility and resiliency against different types of intentional & unintentional image degradations. Interleaving and interference cancellation methods are employed to improve the robustness performance significantly compared to conventional matched filter detection. △ Less

Submitted 11 July, 2012; originally announced July 2012.

Comments: ICACC 2007 International Conference, Madurai, India, 9-10 Feb, 2007

arXiv:1207.2699 [pdf]

doi 10.1109/ICETET.2008.51

Robust Image Watermarking Under Pixel Wise Masking Framework

Authors: V. H. Mankar, T. S. Das, S. Saha, S. K. Sarkar

Abstract: The current paper presents a robust watermarking method for still images, which uses the similarity of discrete wavelet transform and human visual system (HVS). The proposed scheme makes the use of pixel wise masking in order to make binary watermark imperceptible to the HVS. The watermark is embedded in the perceptually significant, spatially selected detail coefficients using sub band adaptive t… ▽ More The current paper presents a robust watermarking method for still images, which uses the similarity of discrete wavelet transform and human visual system (HVS). The proposed scheme makes the use of pixel wise masking in order to make binary watermark imperceptible to the HVS. The watermark is embedded in the perceptually significant, spatially selected detail coefficients using sub band adaptive threshold scheme. The threshold is computed based on the statistical analysis of the wavelet coefficients. The watermark is embedded several times to achieve better robustness. Here, a new type of non-oblivious detection method is proposed. The improvement in robustness performance against different types of deliberate and non-intentional image impairments (lossy compression, scaling, cropping, filtering etc) is supported through experimental results. The reported result also shows improvement in visual and statistical invisibility of the hidden data. The proposed method is compared with a state of the art frequency based watermarking technique, highlighting its performance. This algorithmic architecture utilizes the existing allocated bandwidth in the data transmission channel in a more efficient manner. △ Less

Submitted 11 July, 2012; originally announced July 2012.

Comments: First International Conference on Emerging Trends in Engineering and Technology ICETET 2008

arXiv:1207.2694 [pdf]

Discrete Chaotic Sequence based on Logistic Map in Digital Communications

Authors: V. H. Mankar, T. S. Das, S. K. Sarkar

Abstract: The chaotic systems have been found applications in diverse fields such as pseudo random number generator, coding, cryptography, spread spectrum (SS) communications etc. The inherent capability of generating a large space of PN sequences due to sensitive dependence on initial conditions has been the main reason for exploiting chaos in spread spectrum communication systems. This behaviour suggests… ▽ More The chaotic systems have been found applications in diverse fields such as pseudo random number generator, coding, cryptography, spread spectrum (SS) communications etc. The inherent capability of generating a large space of PN sequences due to sensitive dependence on initial conditions has been the main reason for exploiting chaos in spread spectrum communication systems. This behaviour suggests that it is straightforward to generate a variety of initial condition induced PN sequences with nice statistical properties by quantising the output of an iterated chaotic map. In the present paper the study has been carried out for the feasibility and usefulness of chaotic sequence in SS based applications like communication and watermarking. △ Less

Submitted 11 July, 2012; originally announced July 2012.

Comments: National Conference on "Emerging Trends in Electronics Engineering & Computing" (E3C 2010)

Journal ref: National Conference on "Emerging Trends in Electronics Engineering & Computing" (E3C 2010)

arXiv:1207.2687 [pdf]

Performance Evaluation of Spread Spectrum Watermarking using Error Control Coding

Authors: T. S. Das, V. H. Mankar, S. K. Sarkar

Abstract: This paper proposes an oblivious watermarking algorithm with blind detection approach for high volume data hiding in image signals. We present a detection reliable signal adaptive embedding scheme for multiple messages in selective sub-bands of wavelet (DWT) coefficients using direct sequence spread spectrum (DS-SS) modulation technique. Here the impact of volumetric distortion sources is analyzed… ▽ More This paper proposes an oblivious watermarking algorithm with blind detection approach for high volume data hiding in image signals. We present a detection reliable signal adaptive embedding scheme for multiple messages in selective sub-bands of wavelet (DWT) coefficients using direct sequence spread spectrum (DS-SS) modulation technique. Here the impact of volumetric distortion sources is analyzed on the ability of analytical bounds in order to recover the watermark messages. In this context, the joint source-channel coding scheme has been employed to obtain the better control of the system robustness. This structure prevents the desynchronisation between encoder and decoder due to selective embedding. The experimental results obtained for Spread Spectrum (SS) transformed domain watermarking demonstrate the efficiency of the proposed system. This algorithmic architecture utilizes the existing allocated bandwidth in the data transmission channel in a more efficient manner. △ Less

Submitted 11 July, 2012; originally announced July 2012.

Comments: IET-UK International Conference on Information and Communication Technology in Electrical Sciences (ICTES 2007), Dr. M.G.R. University, Chennai, Tamil Nadu, India. Dec. 20-22, 2007. pp. 708-711

Journal ref: IET-UK International Conference on Information and Communication Technology in Electrical Sciences (ICTES 2007), Dr. M.G.R. University, Chennai, Tamil Nadu, India. Dec. 20-22, 2007. pp. 708-711

arXiv:1207.2675 [pdf]

Multimedia Steganographic Scheme using Multiresolution Analysis

Authors: Tirtha sankar Das, Ayan K. Sau, V. H. Mankar, Subir K. Sarkar

Abstract: Digital steganography or data hiding has emerged as a new area of research in connection to the communication in secured channel as well as intellectual property protection for multimedia signals. The redundancy in image representation can be exploited successfully to embed specified characteristic information with a good quality of imperceptibility. The hidden multimedia information will be commu… ▽ More Digital steganography or data hiding has emerged as a new area of research in connection to the communication in secured channel as well as intellectual property protection for multimedia signals. The redundancy in image representation can be exploited successfully to embed specified characteristic information with a good quality of imperceptibility. The hidden multimedia information will be communicated to the authentic user through secured channel as a part of the data. This article deals with a transform domain, block-based and signal non-adaptive/adaptive technique for inserting multimedia signals into an RGB image. The robustness of the proposed method has been tested compared to the other transform domain techniques. Proposed algorithm also shows improvement in visual and statistical invisibility of the hidden information. △ Less

Submitted 11 July, 2012; originally announced July 2012.

Comments: 3rd International Conference on Computers and Devices for Communication (CODEC-06) Institute of Radio Physics and Electronics, University of Calcutta, December 18-20, 2006

Journal ref: 3rd International Conference on Computers and Devices for Communication (CODEC-06), Institute of Radio Physics and Electronics, University of Calcutta, December 18-20, 2006

arXiv:1105.0377 [pdf]

WiMAX Based 60 GHz Millimeter-Wave Communication for Intelligent Transport System Applications

Authors: Rabindranath Bera, Subir Kumar Sarkar, Bikash Sharma, Samarendra Nath Sur, Debasish Bhaskar, Soumyasree Bera

Abstract: With the successful worldwide deployment of 3rd generation mobile communication, security aspects are ensured partly. Researchers are now looking for 4G mobile for its deployment with high data rate, enhanced security and reliability so that world should look for CALM, Continuous Air interface for Long and Medium range communication. This CALM will be a reliable high data rate secured mobile commu… ▽ More With the successful worldwide deployment of 3rd generation mobile communication, security aspects are ensured partly. Researchers are now looking for 4G mobile for its deployment with high data rate, enhanced security and reliability so that world should look for CALM, Continuous Air interface for Long and Medium range communication. This CALM will be a reliable high data rate secured mobile communication to be deployed for car to car communication (C2C) for safety application. This paper reviewed the WiMAX ,& 60 GHz RF carrier for C2C. The system is tested at SMIT laboratory with multimedia transmission and reception. With proper deployment of this 60 GHz system on vehicles, the existing commercial products for 802.11P will be required to be replaced or updated soon . △ Less

Submitted 2 May, 2011; originally announced May 2011.

Journal ref: International Journal of Wireless & Mobile Networks (IJWMN) Vol. 3, No. 2, April 2011, 214-223

arXiv:1006.1183 [pdf]

doi 10.5121/ijcsit.2010.2305

Hybrid Scenario Based Performance Analysis of DSDV and DSR

Authors: Koushik Majumder, Subir Kumar Sarkar

Abstract: The area of mobile ad hoc networking has received considerable attention of the research community in recent years. These networks have gained immense popularity primarily due to their infrastructure-less mode of operation which makes them a suitable candidate for deployment in emergency scenarios like relief operation, battlefield etc., where either the pre-existing infrastructure is totally dama… ▽ More The area of mobile ad hoc networking has received considerable attention of the research community in recent years. These networks have gained immense popularity primarily due to their infrastructure-less mode of operation which makes them a suitable candidate for deployment in emergency scenarios like relief operation, battlefield etc., where either the pre-existing infrastructure is totally damaged or it is not possible to establish a new infrastructure quickly. However, MANETs are constrained due to the limited transmission range of the mobile nodes which reduces the total coverage area. Sometimes the infrastructure-less ad hoc network may be combined with a fixed network to form a hybrid network which can cover a wider area with the advantage of having less fixed infrastructure. In such a combined network, for transferring data, we need base stations which act as gateways between the wired and wireless domains. Due to the hybrid nature of these networks, routing is considered a challenging task. Several routing protocols have been proposed and tested under various traffic conditions. However, the simulations of such routing protocols usually do not consider the hybrid network scenario. In this work we have carried out a systematic performance study of the two prominent routing protocols: Destination Sequenced Distance Vector Routing (DSDV) and Dynamic Source Routing (DSR) protocols in the hybrid networking environment. We have analyzed the performance differentials on the basis of three metrics - packet delivery fraction, average end-to-end delay and normalized routing load under varying pause time with different number of sources using NS2 based simulation. △ Less

Submitted 11 June, 2010; v1 submitted 7 June, 2010; originally announced June 2010.

Comments: 15 Pages

Journal ref: International Journal of Computer Science and Information Technology 2.3 (2010) 56-70

arXiv:0911.0402 [pdf]

A Cost Effective RFID Based Customized DVD-ROM to Thwart Software Piracy

Authors: Sudip Dogra, Ritwik Ray, Saustav Ghosh, Debharshi Bhattacharya, Subir Kr. Sarkar

Abstract: Software piracy has been a very perilous adversary of the software based industry, from the very beginning of the development of the latter into a significant business. There has been no developed foolproof system that has been developed to appropriately tackle this vile issue. We have in our scheme tried to develop a way to embark upon this problem using a very recently developed technology of… ▽ More Software piracy has been a very perilous adversary of the software based industry, from the very beginning of the development of the latter into a significant business. There has been no developed foolproof system that has been developed to appropriately tackle this vile issue. We have in our scheme tried to develop a way to embark upon this problem using a very recently developed technology of RFID. △ Less

Submitted 2 November, 2009; originally announced November 2009.

Comments: 5 pages IEEE format, International Journal of Computer Science and Information Security, IJCSIS 2009, ISSN 1947 5500, Impact Factor 0.423, http://sites.google.com/site/ijcsis/

Report number: ISSN 1947 5500

Journal ref: International Journal of Computer Science and Information Security, IJCSIS, Vol. 6, No. 1, pp. 034-039, October 2009, USA

arXiv:0903.1506 [pdf]

Wi-Fi, WiMax and WCDMA A comparative study based on Channel Impairments and Equalization method used

Authors: Rabindranath Bera, Sanjib Sil, Sourav Dhar, Subir K. Sarkar

Abstract: In this paper we describe the channel impairments and equalization methods currently used in WiFi, WiMax and WCDMA. After a review of channel model for Intelligent Transportation System (ITS), we proposed an equalization method which will be useful for the estimation of strong multipath channel at a high velocity. In this paper we describe the channel impairments and equalization methods currently used in WiFi, WiMax and WCDMA. After a review of channel model for Intelligent Transportation System (ITS), we proposed an equalization method which will be useful for the estimation of strong multipath channel at a high velocity. △ Less

Submitted 9 March, 2009; originally announced March 2009.

Comments: 5 pages, 15 fig.,published in ISM-08, Bangalore, India

Showing 1–12 of 12 results for author: Sarkar, S K