Search | arXiv e-print repository

doi 10.1088/0004-637X/797/2/102

The Next Generation Virgo Cluster Survey. XV. The photometric redshift estimation for background sources

Authors: A. Raichoor, S. Mei, T. Erben, H. Hildebrandt, M. Huertas-Company, O. Ilbert, R. Licitra, N. M. Ball, S. Boissier, A. Boselli, Y. -T. Chen, P. Côté, J. -C. Cuillandre, P. A. Duc, P. R. Durrell, L. Ferrarese, P. Guhathakurta, S. D. J. Gwyn, J. J. Kavelaars, A. Lançon, C. Liu, L. A. MacArthur, M. Muller, R. P. Muñoz, E. W. Peng , et al. (6 additional authors not shown)

Abstract: The Next Generation Virgo Cluster Survey is an optical imaging survey covering 104 deg^2 centered on the Virgo cluster. Currently, the complete survey area has been observed in the u*giz-bands and one third in the r-band. We present the photometric redshift estimation for the NGVS background sources. After a dedicated data reduction, we perform accurate photometry, with special attention to precis… ▽ More The Next Generation Virgo Cluster Survey is an optical imaging survey covering 104 deg^2 centered on the Virgo cluster. Currently, the complete survey area has been observed in the u*giz-bands and one third in the r-band. We present the photometric redshift estimation for the NGVS background sources. After a dedicated data reduction, we perform accurate photometry, with special attention to precise color measurements through point spread function-homogenization. We then estimate the photometric redshifts with the Le Phare and BPZ codes. We add a new prior which extends to iAB = 12.5 mag. When using the u*griz-bands, our photometric redshifts for 15.5 \le i \lesssim 23 mag or zphot \lesssim 1 galaxies have a bias |Δz| < 0.02, less than 5% outliers, and a scatter σ_{outl.rej.} and an individual error on zphot that increase with magnitude (from 0.02 to 0.05 and from 0.03 to 0.10, respectively). When using the u*giz-bands over the same magnitude and redshift range, the lack of the r-band increases the uncertainties in the 0.3 \lesssim zphot \lesssim 0.8 range (-0.05 < Δz < -0.02, σ_{outl.rej} ~ 0.06, 10-15% outliers, and zphot.err. ~ 0.15). We also present a joint analysis of the photometric redshift accuracy as a function of redshift and magnitude. We assess the quality of our photometric redshifts by comparison to spectroscopic samples and by verifying that the angular auto- and cross-correlation function w(θ) of the entire NGVS photometric redshift sample across redshift bins is in agreement with the expectations. △ Less

Submitted 8 October, 2014; originally announced October 2014.

Comments: Accepted for publication in ApJS. 24 pages, 21 Figures (some with degraded quality to fit the arxiv size limit), 6 Tables

arXiv:1312.3997 [pdf, ps, other]

Focus Demo: CANFAR+Skytree: A Cloud Computing and Data Mining System for Astronomy

Authors: Nicholas M. Ball

Abstract: This is a companion Focus Demonstration article to the CANFAR+Skytree poster (Ball 2012), demonstrating the usage of the Skytree machine learning software on the Canadian Advanced Network for Astronomical Research (CANFAR) cloud computing system. CANFAR+Skytree is the world's first cloud computing system for data mining in astronomy. This is a companion Focus Demonstration article to the CANFAR+Skytree poster (Ball 2012), demonstrating the usage of the Skytree machine learning software on the Canadian Advanced Network for Astronomical Research (CANFAR) cloud computing system. CANFAR+Skytree is the world's first cloud computing system for data mining in astronomy. △ Less

Submitted 13 December, 2013; originally announced December 2013.

Comments: 4 pages, 2 figures, uses asp2010.sty. Written when at National Research Council Canada. Now at Skytree, Inc., San Jose, CA, USA. There is a companion paper: "CANFAR+Skytree: A Cloud Computing and Data Mining System for Astronomy". Astronomical Data Analysis Software & Systems (ADASS) XXII, 2012, ASP Conference Proceedings, eds. Friedel D., Freemon M., Plante R. (San Francisco: ASP)

arXiv:1312.3996 [pdf, ps, other]

CANFAR+Skytree: A Cloud Computing and Data Mining System for Astronomy

Authors: Nicholas M. Ball

Abstract: At the Canadian Astronomy Data Centre, we have combined our cloud computing system, CANFAR, with the world's most advanced machine learning software, Skytree, to create the world's first cloud computing system for data mining in astronomy. CANFAR provides a generic environment for the storage and processing of large datasets, removing the requirement to set up and maintain a computing system when… ▽ More At the Canadian Astronomy Data Centre, we have combined our cloud computing system, CANFAR, with the world's most advanced machine learning software, Skytree, to create the world's first cloud computing system for data mining in astronomy. CANFAR provides a generic environment for the storage and processing of large datasets, removing the requirement to set up and maintain a computing system when implementing an extensive undertaking such as a survey pipeline. 500 processor cores and several hundred terabytes of persistent storage are currently available to users. The storage is implemented via the International Virtual Observatory Alliance's VOSpace protocol, and is accessible both interactively, and to all processing jobs. The user interacts with CANFAR by utilizing virtual machines, which appear to them as equivalent to a desktop. Each machine is replicated as desired to perform large-scale parallel processing. Such an arrangement enables the user to immediately install and run the same astronomy code that they already utilize, in the same way as on a desktop. In addition, unlike many cloud systems, batch job scheduling is handled for the user on multiple virtual machines by the Condor job queueing system. Skytree is installed and run just as any other software on the system, and thus acts as a library of command line data mining functions that can be integrated into one's wider analysis. Thus we have created a generic environment for large-scale analysis by data mining, in the same way that CANFAR itself has done for storage and processing. Because Skytree scales to large data in linear runtime, this allows the full sophistication of the huge fields of data mining and machine learning to be applied to the hundreds of millions of objects that make up current large datasets. We demonstrate the utility of the CANFAR+Skytree system by showing science results obtained. [Abridged] △ Less

Submitted 13 December, 2013; originally announced December 2013.

Comments: 4 pages, 2 figures, uses asp2010.sty. Written when at National Research Council Canada. Now at Skytree, Inc., San Jose, CA, USA. There is a companion paper: "Focus Demo: CANFAR+Skytree: A Cloud Computing and Data Mining System for Astronomy". Astronomical Data Analysis Software & Systems (ADASS) XXII, 2012, ASP Conference Proceedings, eds. Friedel D., Freemon M., Plante R. (San Francisco: ASP)

arXiv:1110.5688 [pdf, ps, other]

Discussion on "Techniques for Massive-Data Machine Learning in Astronomy" by A. Gray

Authors: Nicholas M. Ball

Abstract: Astronomy is increasingly encountering two fundamental truths: (1) The field is faced with the task of extracting useful information from extremely large, complex, and high dimensional datasets; (2) The techniques of astroinformatics and astrostatistics are the only way to make this tractable, and bring the required level of sophistication to the analysis. Thus, an approach which provides these to… ▽ More Astronomy is increasingly encountering two fundamental truths: (1) The field is faced with the task of extracting useful information from extremely large, complex, and high dimensional datasets; (2) The techniques of astroinformatics and astrostatistics are the only way to make this tractable, and bring the required level of sophistication to the analysis. Thus, an approach which provides these tools in a way that scales to these datasets is not just desirable, it is vital. The expertise required spans not just astronomy, but also computer science, statistics, and informatics. As a computer scientist and expert in machine learning, Alex's contribution of expertise and a large number of fast algorithms designed to scale to large datasets, is extremely welcome. We focus in this discussion on the questions raised by the practical application of these algorithms to real astronomical datasets. That is, what is needed to maximally leverage their potential to improve the science return? This is not a trivial task. While computing and statistical expertise are required, so is astronomical expertise. Precedent has shown that, to-date, the collaborations most productive in producing astronomical science results (e.g, the Sloan Digital Sky Survey), have either involved astronomers expert in computer science and/or statistics, or astronomers involved in close, long-term collaborations with experts in those fields. This does not mean that the astronomers are giving the most important input, but simply that their input is crucial in guiding the effort in the most fruitful directions, and coping with the issues raised by real data. Thus, the tools must be useable and understandable by those whose primary expertise is not computing or statistics, even though they may have quite extensive knowledge of those fields. △ Less

Submitted 25 October, 2011; originally announced October 2011.

Comments: 6 pages, 1 figure. Invited commentary, Statistical Challenges in Modern Astronomy V, Penn State, Jun 2011

arXiv:1110.5685 [pdf, ps, other]

doi 10.1007/978-1-4614-3323-1_6

Utilizing Astroinformatics to Maximize the Science Return of the Next Generation Virgo Cluster Survey

Authors: Nicholas M. Ball

Abstract: The Next Generation Virgo Cluster Survey is a 104 square degree survey of the Virgo Cluster, carried out using the MegaPrime camera of the Canada-France-Hawaii telescope, from semesters 2009A-2012A. The survey will provide coverage of this nearby dense environment in the universe to unprecedented depth, providing profound insights into galaxy formation and evolution, including definitive measureme… ▽ More The Next Generation Virgo Cluster Survey is a 104 square degree survey of the Virgo Cluster, carried out using the MegaPrime camera of the Canada-France-Hawaii telescope, from semesters 2009A-2012A. The survey will provide coverage of this nearby dense environment in the universe to unprecedented depth, providing profound insights into galaxy formation and evolution, including definitive measurements of the properties of galaxies in a dense environment in the local universe, such as the luminosity function. The limiting magnitude of the survey is g_AB = 25.7 (10 sigma point source), and the 2 sigma surface brightness limit is g_AB ~ 29 mag arcsec^-2. The data volume of the survey (approximately 50 terabytes of images), while large by contemporary astronomical standards, is not intractable. This renders the survey amenable to the methods of astroinformatics. The enormous dynamic range of objects, from the giant elliptical galaxy M87 at M(B) = -21.6, to the faintest dwarf ellipticals at M(B) ~ -6, combined with photometry in 5 broad bands (u* g' r' i' z'), and unprecedented depth revealing many previously unseen structures, creates new challenges in object detection and classification. We present results from ongoing work on the survey, including photometric redshifts, Virgo cluster membership, and the implementation of fast data mining algorithms on the infrastructure of the Canadian Astronomy Data Centre, as part of the Canadian Advanced Network for Astronomical Research (CANFAR). △ Less

Submitted 25 October, 2011; originally announced October 2011.

Comments: 8 pages, 2 figures. Accepted for the Joint Workshop and Summer School: Astrostatistics and Data Mining in Large Astronomical Databases, La Palma, May 30th - June 3rd 2011. A higher resolution version is available at http://sites.google.com/site/nickballastronomer/publications

arXiv:0906.2173 [pdf, ps, other]

doi 10.1142/S0218271810017160

Data Mining and Machine Learning in Astronomy

Authors: Nicholas M. Ball, Robert J. Brunner

Abstract: We review the current state of data mining and machine learning in astronomy. 'Data Mining' can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be littl… ▽ More We review the current state of data mining and machine learning in astronomy. 'Data Mining' can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black-box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those where data mining techniques directly resulted in improved science, and important current and future directions, including probability density functions, parallel algorithms, petascale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm, and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box. △ Less

Submitted 10 August, 2010; v1 submitted 11 June, 2009; originally announced June 2009.

Comments: Published in IJMPD. 61 pages, uses ws-ijmpd.cls. Several extra figures, some minor additions to the text

Journal ref: Int.J.Mod.Phys.D19:1049-1106,2010

arXiv:0903.3121 [pdf, ps, other]

doi 10.1111/j.1365-2966.2009.15432.x

Incorporating Photometric Redshift Probability Density Information into Real-Space Clustering Measurements

Authors: Adam D Myers, Martin White, Nicholas M. Ball

Abstract: The use of photometric redshifts in cosmology is increasing. Often, however these photo-zs are treated like spectroscopic observations, in that the peak of the photometric redshift, rather than the full probability density function (PDF), is used. This overlooks useful information inherent in the full PDF. We introduce a new real-space estimator for one of the most used cosmological statistics,… ▽ More The use of photometric redshifts in cosmology is increasing. Often, however these photo-zs are treated like spectroscopic observations, in that the peak of the photometric redshift, rather than the full probability density function (PDF), is used. This overlooks useful information inherent in the full PDF. We introduce a new real-space estimator for one of the most used cosmological statistics, the 2-point correlation function, that weights by the PDF of individual photometric objects in a manner that is optimal when Poisson statistics dominate. As our estimator does not bin based on the PDF peak it substantially enhances the clustering signal by usefully incorporating information from all photometric objects that overlap the redshift bin of interest. As a real-world application, we measure QSO clustering in the Sloan Digital Sky Survey (SDSS). We find that our simplest binned estimator improves the clustering signal by a factor equivalent to increasing the survey size by a factor of 2-3. We also introduce a new implementation that fully weights between pairs of objects in constructing the cross-correlation and find that this pair-weighted estimator improves clustering signal in a manner equivalent to increasing the survey size by a factor of 4-5. Our technique uses spectroscopic data to anchor the distance scale and it will be particularly useful where spectroscopic data (e.g, from BOSS) overlaps deeper photometry (e.g.,from Pan-STARRS, DES or the LSST). We additionally provide simple, informative expressions to determine when our estimator will be competitive with the autocorrelation of spectroscopic objects. Although we use QSOs as an example population, our estimator can and should be applied to any clustering estimate that uses photometric objects. △ Less

Submitted 14 September, 2009; v1 submitted 18 March, 2009; originally announced March 2009.

Comments: Replaced with accepted version. Major changes have been made, in that we now directly demonstrate the full, PDF pair-weighted clustering estimator and show that it increases the clustering signal even more substantially

arXiv:0804.3417 [pdf, other]

Robust Machine Learning Applied to Terascale Astronomical Datasets

Authors: Nicholas M. Ball, Robert J. Brunner, Adam D. Myers

Abstract: We present recent results from the LCDM (Laboratory for Cosmological Data Mining; http://lcdm.astro.uiuc.edu) collaboration between UIUC Astronomy and NCSA to deploy supercomputing cluster resources and machine learning algorithms for the mining of terascale astronomical datasets. This is a novel application in the field of astronomy, because we are using such resources for data mining, and not… ▽ More We present recent results from the LCDM (Laboratory for Cosmological Data Mining; http://lcdm.astro.uiuc.edu) collaboration between UIUC Astronomy and NCSA to deploy supercomputing cluster resources and machine learning algorithms for the mining of terascale astronomical datasets. This is a novel application in the field of astronomy, because we are using such resources for data mining, and not just performing simulations. Via a modified implementation of the NCSA cyberenvironment Data-to-Knowledge, we are able to provide improved classifications for over 100 million stars and galaxies in the Sloan Digital Sky Survey, improved distance measures, and a full exploitation of the simple but powerful k-nearest neighbor algorithm. A driving principle of this work is that our methods should be extensible from current terascale datasets to upcoming petascale datasets and beyond. We discuss issues encountered to-date, and further issues for the transition to petascale. In particular, disk I/O will become a major limiting factor unless the necessary infrastructure is implemented. △ Less

Submitted 21 April, 2008; originally announced April 2008.

Comments: 11 pages, 2 figures, uses llncs.cls. To appear in the 9th LCI International Conference on High-Performance Clustered Computing

Report number: Not arXiv:0710.4482

arXiv:0804.3413 [pdf, ps, other]

doi 10.1086/589646

Robust Machine Learning Applied to Astronomical Datasets III: Probabilistic Photometric Redshifts for Galaxies and Quasars in the SDSS and GALEX

Authors: Nicholas M. Ball, Robert J. Brunner, Adam D. Myers, Natalie E. Strand, Stacey L. Alberts, David Tcheng

Abstract: We apply machine learning in the form of a nearest neighbor instance-based algorithm (NN) to generate full photometric redshift probability density functions (PDFs) for objects in the Fifth Data Release of the Sloan Digital Sky Survey (SDSS DR5). We use a conceptually simple but novel application of NN to generate the PDFs - perturbing the object colors by their measurement error - and using the… ▽ More We apply machine learning in the form of a nearest neighbor instance-based algorithm (NN) to generate full photometric redshift probability density functions (PDFs) for objects in the Fifth Data Release of the Sloan Digital Sky Survey (SDSS DR5). We use a conceptually simple but novel application of NN to generate the PDFs - perturbing the object colors by their measurement error - and using the resulting instances of nearest neighbor distributions to generate numerous individual redshifts. When the redshifts are compared to existing SDSS spectroscopic data, we find that the mean value of each PDF has a dispersion between the photometric and spectroscopic redshift consistent with other machine learning techniques, being sigma = 0.0207 +/- 0.0001 for main sample galaxies to r < 17.77 mag, sigma = 0.0243 +/- 0.0002 for luminous red galaxies to r < ~19.2 mag, and sigma = 0.343 +/- 0.005 for quasars to i < 20.3 mag. The PDFs allow the selection of subsets with improved statistics. For quasars, the improvement is dramatic: for those with a single peak in their probability distribution, the dispersion is reduced from 0.343 to sigma = 0.117 +/- 0.010, and the photometric redshift is within 0.3 of the spectroscopic redshift for 99.3 +/- 0.1% of the objects. Thus, for this optical quasar sample, we can virtually eliminate 'catastrophic' photometric redshift estimates. In addition to the SDSS sample, we incorporate ultraviolet photometry from the Third Data Release of the Galaxy Evolution Explorer All-Sky Imaging Survey (GALEX AIS GR3) to create PDFs for objects seen in both surveys. For quasars, the increased coverage of the observed frame UV of the SED results in significant improvement over the full SDSS sample, with sigma = 0.234 +/- 0.010. We demonstrate that this improvement is genuine. [Abridged] △ Less

Submitted 21 April, 2008; originally announced April 2008.

Comments: Accepted to ApJ, 10 pages, 12 figures, uses emulateapj.cls

arXiv:0710.4482 [pdf, ps, other]

Robust Machine Learning Applied to Terascale Astronomical Datasets

Authors: Nicholas M. Ball, Robert J. Brunner, Adam D. Myers

Abstract: We present recent results from the Laboratory for Cosmological Data Mining (http://lcdm.astro.uiuc.edu) at the National Center for Supercomputing Applications (NCSA) to provide robust classifications and photometric redshifts for objects in the terascale-class Sloan Digital Sky Survey (SDSS). Through a combination of machine learning in the form of decision trees, k-nearest neighbor, and genetic… ▽ More We present recent results from the Laboratory for Cosmological Data Mining (http://lcdm.astro.uiuc.edu) at the National Center for Supercomputing Applications (NCSA) to provide robust classifications and photometric redshifts for objects in the terascale-class Sloan Digital Sky Survey (SDSS). Through a combination of machine learning in the form of decision trees, k-nearest neighbor, and genetic algorithms, the use of supercomputing resources at NCSA, and the cyberenvironment Data-to-Knowledge, we are able to provide improved classifications for over 100 million objects in the SDSS, improved photometric redshifts, and a full exploitation of the powerful k-nearest neighbor algorithm. This work is the first to apply the full power of these algorithms to contemporary terascale astronomical datasets, and the improvement over existing results is demonstrable. We discuss issues that we have encountered in dealing with data on the terascale, and possible solutions that can be implemented to deal with upcoming petascale datasets. △ Less

Submitted 24 October, 2007; originally announced October 2007.

Comments: 4 pages, 1 figure, uses adassconf.sty, asp2006.sty. To appear in the proceedings of ADASS XVII, London, UK, Sep 2007

arXiv:astro-ph/0612471 [pdf, ps, other]

doi 10.1086/518362

Robust Machine Learning Applied to Astronomical Datasets II: Quantifying Photometric Redshifts for Quasars Using Instance-Based Learning

Authors: Nicholas M. Ball, Robert J. Brunner, Adam D. Myers, Natalie E. Strand, Stacey L. Alberts, David Tcheng, Xavier Llorà

Abstract: We apply instance-based machine learning in the form of a k-nearest neighbor algorithm to the task of estimating photometric redshifts for 55,746 objects spectroscopically classified as quasars in the Fifth Data Release of the Sloan Digital Sky Survey. We compare the results obtained to those from an empirical color-redshift relation (CZR). In contrast to previously published results using CZRs,… ▽ More We apply instance-based machine learning in the form of a k-nearest neighbor algorithm to the task of estimating photometric redshifts for 55,746 objects spectroscopically classified as quasars in the Fifth Data Release of the Sloan Digital Sky Survey. We compare the results obtained to those from an empirical color-redshift relation (CZR). In contrast to previously published results using CZRs, we find that the instance-based photometric redshifts are assigned with no regions of catastrophic failure. Remaining outliers are simply scattered about the ideal relation, in a similar manner to the pattern seen in the optical for normal galaxies at redshifts z < ~1. The instance-based algorithm is trained on a representative sample of the data and pseudo-blind-tested on the remaining unseen data. The variance between the photometric and spectroscopic redshifts is sigma^2 = 0.123 +/- 0.002 (compared to sigma^2 = 0.265 +/- 0.006 for the CZR), and 54.9 +/- 0.7%, 73.3 +/- 0.6%, and 80.7 +/- 0.3% of the objects are within delta z < 0.1, 0.2, and 0.3 respectively. We also match our sample to the Second Data Release of the Galaxy Evolution Explorer legacy data and the resulting 7,642 objects show a further improvement, giving a variance of sigma^2 = 0.054 +/- 0.005, and 70.8 +/- 1.2%, 85.8 +/- 1.0%, and 90.8 +/- 0.7% of objects within delta z < 0.1, 0.2, and 0.3. We show that the improvement is indeed due to the extra information provided by GALEX, by training on the same dataset using purely SDSS photometry, which has a variance of sigma^2 = 0.090 +/- 0.007. Each set of results represents a realistic standard for application to further datasets for which the spectra are representative. △ Less

Submitted 22 March, 2007; v1 submitted 17 December, 2006; originally announced December 2006.

Comments: 8 pages, 5 figures, textual changes to match ApJ accepted version, uses emulateapj.cls

Journal ref: Astrophys.J.663:774-780,2007

arXiv:astro-ph/0610171 [pdf, ps, other]

doi 10.1111/j.1365-2966.2007.12627.x

Galaxy Colour, Morphology, and Environment in the Sloan Digital Sky Survey

Authors: Nicholas M. Ball, Jon Loveday, Robert J. Brunner

Abstract: We use the Fourth Data Release of the Sloan Digital Sky Survey to investigate the relation between galaxy rest frame u-r colour, morphology, as described by the concentration and Sersic indices, and environmental density, for a sample of 79,553 galaxies at z < ~0.1. We split the samples according to density and luminosity and recover the expected bimodal distribution in the colour-morphology pla… ▽ More We use the Fourth Data Release of the Sloan Digital Sky Survey to investigate the relation between galaxy rest frame u-r colour, morphology, as described by the concentration and Sersic indices, and environmental density, for a sample of 79,553 galaxies at z < ~0.1. We split the samples according to density and luminosity and recover the expected bimodal distribution in the colour-morphology plane, shown especially clearly by this subsampling. We quantify the bimodality by a sum of two Gaussians on the colour and morphology axes and show that, for the red/early-type population both colour and morphology do not change significantly as a function of density. For the blue/late-type population, with increasing density the colour becomes redder but the morphology again does not change significantly. Both populations become monotonically redder and of earlier type with increasing luminosity. There is no significant qualitative difference between the behaviour of the two morphological measures. We supplement the morphological sample with 13,655 galaxies assigned Hubble types by an artificial neural network. We find, however, that the resulting distribution is less well described by two Gaussians. Therefore, there are either more than two significant morphological populations, physical processes not seen in colour space, or the Hubble type, particularly the different subtypes of spirals Sa-Sd, has an irreducible fuzziness when related to environmental density. For each of the three measures of morphology, on removing the density relation due to it, we recover a strong residual relation in colour. However, on similarly removing the colour-density relation there is no evidence for a residual relation due to morphology. [Abridged] △ Less

Submitted 24 October, 2007; v1 submitted 5 October, 2006; originally announced October 2006.

Comments: Substantial revision to match MNRAS accepted version. Overall conclusions unchanged. 16 pages, 13 figures

Journal ref: Mon.Not.Roy.Astron.Soc.383:907-922,2008

arXiv:astro-ph/0606541 [pdf, ps, other]

doi 10.1086/507440

Robust Machine Learning Applied to Astronomical Datasets I: Star-Galaxy Classification of the SDSS DR3 Using Decision Trees

Authors: Nicholas M. Ball, Robert J. Brunner, Adam D. Myers, David Tcheng

Abstract: We provide classifications for all 143 million non-repeat photometric objects in the Third Data Release of the Sloan Digital Sky Survey (SDSS) using decision trees trained on 477,068 objects with SDSS spectroscopic data. We demonstrate that these star/galaxy classifications are expected to be reliable for approximately 22 million objects with r < ~20. The general machine learning environment Dat… ▽ More We provide classifications for all 143 million non-repeat photometric objects in the Third Data Release of the Sloan Digital Sky Survey (SDSS) using decision trees trained on 477,068 objects with SDSS spectroscopic data. We demonstrate that these star/galaxy classifications are expected to be reliable for approximately 22 million objects with r < ~20. The general machine learning environment Data-to-Knowledge and supercomputing resources enabled extensive investigation of the decision tree parameter space. This work presents the first public release of objects classified in this way for an entire SDSS data release. The objects are classified as either galaxy, star or nsng (neither star nor galaxy), with an associated probability for each class. To demonstrate how to effectively make use of these classifications, we perform several important tests. First, we detail selection criteria within the probability space defined by the three classes to extract samples of stars and galaxies to a given completeness and efficiency. Second, we investigate the efficacy of the classifications and the effect of extrapolating from the spectroscopic regime by performing blind tests on objects in the SDSS, 2dF Galaxy Redshift and 2dF QSO Redshift (2QZ) surveys. Given the photometric limits of our spectroscopic training data, we effectively begin to extrapolate past our star-galaxy training set at r ~ 18. By comparing the number counts of our training sample with the classified sources, however, we find that our efficiencies appear to remain robust to r ~ 20. As a result, we expect our classifications to be accurate for 900,000 galaxies and 6.7 million stars, and remain robust via extrapolation for a total of 8.0 million galaxies and 13.9 million stars. [Abridged] △ Less

Submitted 21 June, 2006; originally announced June 2006.

Comments: 27 pages, 12 figures, to be published in ApJ, uses emulateapj.cls

Journal ref: Astrophys.J.650:497-509,2006

arXiv:astro-ph/0507547 [pdf, ps, other]

doi 10.1111/j.1365-2966.2006.11082.x

Bivariate Galaxy Luminosity Functions in the Sloan Digital Sky Survey

Authors: Nicholas M Ball, Jon Loveday, Robert J Brunner, Ivan K Baldry, Jon Brinkmann

Abstract: Bivariate luminosity functions (LFs) are computed for galaxies in the New York Value-Added Galaxy Catalogue, based on the Sloan Digital Sky Survey Data Release 4. The galaxy properties investigated are the morphological type, inverse concentration index, Sersic index, absolute effective surface brightness, reference frame colours, absolute radius, eClass spectral type, stellar mass and galaxy en… ▽ More Bivariate luminosity functions (LFs) are computed for galaxies in the New York Value-Added Galaxy Catalogue, based on the Sloan Digital Sky Survey Data Release 4. The galaxy properties investigated are the morphological type, inverse concentration index, Sersic index, absolute effective surface brightness, reference frame colours, absolute radius, eClass spectral type, stellar mass and galaxy environment. The morphological sample is flux-limited to galaxies with r < 15.9 and consists of 37,047 classifications to an RMS accuracy of +/- half a class in the sequence E, S0, Sa, Sb, Sc, Sd, Im. These were assigned by an artificial neural network, based on a training set of 645 eyeball classifications. The other samples use r < 17.77 with a median redshift of z ~ 0.08, and a limiting redshift of z < 0.15 to minimize the effects of evolution. Other cuts, for example in axis ratio, are made to minimize biases. A wealth of detail is seen, with clear variations between the LFs according to absolute magnitude and the second parameter. They are consistent with an early type, bright, concentrated, red population and a late type, faint, less concentrated, blue, star forming population. This bimodality suggests two major underlying physical processes, which in agreement with previous authors we hypothesize to be merger and accretion, associated with the properties of bulges and discs respectively. The bivariate luminosity-surface brightness distribution is fit with the Choloniewski function (a Schechter function in absolute magnitude and Gaussian in surface brightness). The fit is found to be poor, as might be expected if there are two underlying processes. △ Less

Submitted 18 September, 2006; v1 submitted 22 July, 2005; originally announced July 2005.

Comments: Major changes to match MNRAS accepted version: updated to SDSS Data Release 4, added completeness maps, and lengthened text. 26 pages, 20 figures

Journal ref: Mon.Not.Roy.Astron.Soc.373:845-868,2006

arXiv:astro-ph/0306390 [pdf, ps, other]

doi 10.1111/j.1365-2966.2004.07429.x

Galaxy Types in the Sloan Digital Sky Survey Using Supervised Artificial Neural Networks

Authors: Nicholas M Ball, Jon Loveday, Masataka Fukugita, Osamu Nakamura, Sadanori Okamura, Jon Brinkmann, Robert J Brunner

Abstract: Supervised artificial neural networks are used to predict useful properties of galaxies in the Sloan Digital Sky Survey, in this instance morphological classifications, spectral types and redshifts. By giving the trained networks unseen data, it is found that correlations between predicted and actual properties are around 0.9 with rms errors of order ten per cent. Thus, given a representative tr… ▽ More Supervised artificial neural networks are used to predict useful properties of galaxies in the Sloan Digital Sky Survey, in this instance morphological classifications, spectral types and redshifts. By giving the trained networks unseen data, it is found that correlations between predicted and actual properties are around 0.9 with rms errors of order ten per cent. Thus, given a representative training set, these properties may be reliably estimated for galaxies in the survey for which there are no spectra and without human intervention. △ Less

Submitted 19 June, 2003; originally announced June 2003.

Comments: Submitted to MNRAS; 9 pages; University of Sussex, UK. Postscript containing higher resolution versions of figures 2 and 3 is available at http://www.astronomy.sussex.ac.uk/~kape7/ball_030618_mnras.ps.gz . The figures are also available separately at http://www.astronomy.sussex.ac.uk/~kape7/ball_030618_figure2_mnras.eps.gz and http://www.astronomy.sussex.ac.uk/~kape7/ball_030618_figure3_mnras.eps.gz

Journal ref: Mon.Not.Roy.Astron.Soc.348:1038,2004

arXiv:astro-ph/0110492 [pdf, ps, other]

Morphological Classification of Galaxies Using Artificial Neural Networks

Authors: Nicholas M. Ball

Abstract: The results of morphological galaxy classifications performed by humans and by automated methods are compared. In particular, a comparison is made between the eyeball classifications of 454 galaxies in the Sloan Digital Sky Survey (SDSS) commissioning data (Shimasaku et al. 2001) with those of supervised artificial neural network programs constructed using the MATLAB Neural Network Toolbox packa… ▽ More The results of morphological galaxy classifications performed by humans and by automated methods are compared. In particular, a comparison is made between the eyeball classifications of 454 galaxies in the Sloan Digital Sky Survey (SDSS) commissioning data (Shimasaku et al. 2001) with those of supervised artificial neural network programs constructed using the MATLAB Neural Network Toolbox package. Networks in this package have not previously been used for galaxy classification. It is found that simple neural networks are able to improve on the results of linear classifiers, giving correlation coefficients of the order of 0.8 +/- 0.1, compared with those of around 0.7 +/- 0.1 for linear classifiers. The networks are trained using the resilient backpropagation algorithm, which, to the author's knowledge, has not been specifically used in the galaxy classification literature. The galaxy parameters used and the network architecture are both important, and in particular the galaxy concentration index, a measure of the concentration of light towards the centre of the galaxy, is the most significant parameter. Simple networks are briefly applied to 29,429 galaxies with redshifts from the SDSS Early Data Release. They give an approximate ratio of types E/S0:Sp:Irr of 14 +/- 5 : 86 +/- 12 : 0 +/- 0.1, which broadly agrees with the well known approximate ratios of 20:80:1 observed at low redshift. △ Less

Submitted 22 October, 2001; originally announced October 2001.

Comments: MSc thesis (1 year postgraduate course), University of Sussex, UK; 80 pages, submitted August 30th 2001

Showing 1–16 of 16 results for author: Ball, N M