-
Pivoting the paradigm: the role of spreadsheets in K-12 data science
Authors:
Oren Tirschwell,
Nicholas Jon Horton
Abstract:
Spreadsheet tools are widely accessible to and commonly used by K-12 students and teachers. They have an important role in data collection and organization. Beyond data organization, spreadsheets also make data visible and easy to interact with, facilitating student engagement in data exploration and analysis. Though not suitable for all circumstances, spreadsheets can and do help foster data and…
▽ More
Spreadsheet tools are widely accessible to and commonly used by K-12 students and teachers. They have an important role in data collection and organization. Beyond data organization, spreadsheets also make data visible and easy to interact with, facilitating student engagement in data exploration and analysis. Though not suitable for all circumstances, spreadsheets can and do help foster data and computing skills for K-12 students. This paper 1) reviews prior frameworks on K-12 data tools; 2) proposes data-driven learning outcomes that can be accomplished by incorporating spreadsheets into the curriculum; and 3) discusses how spreadsheets can help develop data acumen and computational fluency. We provide example class activities, identify challenges and barriers to adoption, suggest pedagogical approaches to ease the learning curve for instructors and students, and discuss the need for professional development to facilitate deeper use of spreadsheets for data science and STEM disciplines.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
The Exchangeability Assumption for Permutation Tests of Multiple Regression Models: Implications for Statistics and Data Science Educators
Authors:
Johanna Hardin,
Lauren Quesada,
Julie Ye,
Nicholas J. Horton
Abstract:
Permutation tests are a powerful and flexible approach to inference via resampling. As computational methods become more ubiquitous in the statistics curriculum, use of permutation tests has become more tractable. At the heart of the permutation approach is the exchangeability assumption, which determines the appropriate null sampling distribution. We explore the exchangeability assumption in the…
▽ More
Permutation tests are a powerful and flexible approach to inference via resampling. As computational methods become more ubiquitous in the statistics curriculum, use of permutation tests has become more tractable. At the heart of the permutation approach is the exchangeability assumption, which determines the appropriate null sampling distribution. We explore the exchangeability assumption in the context of permutation tests for multiple linear regression models, including settings where the assumption is not tenable. Various permutation schemes for the multiple linear regression setting have been proposed and assessed in the literature. As has been demonstrated previously, in most settings, the choice of how to permute a multiple linear regression model does not materially change inferential conclusions with respect to Type I errors. However, some violations (e.g., when clustering is not appropriately accounted for) lead to issues with Type I error rates. Regardless, we believe that understanding (1) exchangeability in the multiple linear regression setting and also (2) how it relates to the null hypothesis of interest is valuable. We close with pedagogical recommendations for instructors who want to bring multiple linear regression permutation inference into their classroom as a way to deepen student understanding of resampling-based inference.
△ Less
Submitted 5 June, 2025; v1 submitted 11 June, 2024;
originally announced June 2024.
-
Guidelines and Best Practices to Share Deidentified Data and Code
Authors:
Nicholas J. Horton,
Sara Stoudt
Abstract:
In 2022, the Journal of Statistics and Data Science Education (JSDSE) instituted augmented requirements for authors to post deidentified data and code underlying their papers. These changes were prompted by an increased focus on reproducibility and open science (NASEM 2019). A recent review of data availability practices noted that "such policies help increase the reproducibility of the published…
▽ More
In 2022, the Journal of Statistics and Data Science Education (JSDSE) instituted augmented requirements for authors to post deidentified data and code underlying their papers. These changes were prompted by an increased focus on reproducibility and open science (NASEM 2019). A recent review of data availability practices noted that "such policies help increase the reproducibility of the published literature, as well as make a larger body of data available for reuse and re-analysis" (PLOS ONE, 2024). JSDSE values accessibility as it endeavors to share knowledge that can improve educational approaches to teaching statistics and data science. Because institution, environment, and students differ across readers of the journal, it is especially important to facilitate the transfer of a journal article's findings to new contexts. This process may require digging into more of the details, including the deidentified data and code. Our goal is to provide our readers and authors with a review of why the requirements for code and data sharing were instituted, summarize ongoing trends and developments in open science, discuss options for data and code sharing, and share advice for authors.
△ Less
Submitted 8 July, 2024; v1 submitted 28 May, 2024;
originally announced May 2024.
-
Data science transfer pathways from associate's to bachelor's programs
Authors:
Benjamin S. Baumer,
Nicholas J. Horton
Abstract:
A substantial fraction of students who complete their college education at a public university in the United States begin their journey at one of the 935 public two-year colleges. While the number of four-year colleges offering bachelor's degrees in data science continues to increase, data science instruction at many two-year colleges lags behind. A major impediment is the relative paucity of intr…
▽ More
A substantial fraction of students who complete their college education at a public university in the United States begin their journey at one of the 935 public two-year colleges. While the number of four-year colleges offering bachelor's degrees in data science continues to increase, data science instruction at many two-year colleges lags behind. A major impediment is the relative paucity of introductory data science courses that serve multiple student audiences and can easily transfer. In addition, the lack of pre-defined transfer pathways (or articulation agreements) for data science creates a growing disconnect that leaves students who want to study data science at a disadvantage. We describe opportunities and barriers to data science transfer pathways. Five points of curricular friction merit attention: 1) a first course in data science, 2) a second course in data science, 3) a course in scientific computing, data science workflow, and/or reproducible computing, 4) lab sciences, and 5) navigating communication, ethics, and application domain requirements in the context of general education and liberal arts course mappings. We catalog existing transfer pathways, efforts to align curricula across institutions, obstacles to overcome with minimally-disruptive solutions, and approaches to foster these pathways. Improvements in these areas are critically important to ensure that a broad and diverse set of students are able to engage and succeed in undergraduate data science programs.
△ Less
Submitted 6 January, 2023; v1 submitted 22 October, 2022;
originally announced October 2022.
-
Fostering better coding practices for data scientists
Authors:
Randall Pruim,
Maria-Cristiana Gîrjău,
Nicholas J. Horton
Abstract:
Many data science students and practitioners don't see the value in making time to learn and adopt good coding practices as long as the code "works". However, code standards are an important part of modern data science practice, and they play an essential role in the development of data acumen. Good coding practices lead to more reliable code and save more time than they cost, making them importan…
▽ More
Many data science students and practitioners don't see the value in making time to learn and adopt good coding practices as long as the code "works". However, code standards are an important part of modern data science practice, and they play an essential role in the development of data acumen. Good coding practices lead to more reliable code and save more time than they cost, making them important even for beginners. We believe that principled coding is vital for quality data science practice. To effectively instill these practices within academic programs, instructors and programs need to begin establishing these practices early, to reinforce them often, and to hold themselves to a higher standard while guiding students. We describe key aspects of good coding practices for data science, illustrating with examples in R and in Python, though similar standards are applicable to other software environments. Practical coding guidelines are organized into a top ten list.
△ Less
Submitted 25 August, 2023; v1 submitted 8 October, 2022;
originally announced October 2022.
-
Spam four ways: Making sense of text data
Authors:
Nicholas J. Horton,
Jie Chao,
William Finzer,
Phebe Palmer
Abstract:
The world is full of text data, yet text analytics has not traditionally played a large part in statistics education. We consider four different ways to provide students with opportunities to explore whether email messages are unwanted correspondence (spam). Text from subject lines are used to identify features that can be used in classification. The approaches include use of a Model Eliciting Act…
▽ More
The world is full of text data, yet text analytics has not traditionally played a large part in statistics education. We consider four different ways to provide students with opportunities to explore whether email messages are unwanted correspondence (spam). Text from subject lines are used to identify features that can be used in classification. The approaches include use of a Model Eliciting Activity, exploration with CODAP, modeling with a specially designed Shiny app, and coding more sophisticated analyses using R. The approaches vary in their use of technology and code but all share the common goal of using data to make better decisions and assessment of the accuracy of those decisions.
△ Less
Submitted 11 February, 2022;
originally announced February 2022.
-
An educator's perspective of the tidyverse
Authors:
Mine Çetinkaya-Rundel,
Johanna Hardin,
Benjamin S. Baumer,
Amelia McNamara,
Nicholas J. Horton,
Colin Rundel
Abstract:
Computing makes up a large and growing component of data science and statistics courses. Many of those courses, especially when taught by faculty who are statisticians by training, teach R as the programming language. A number of instructors have opted to build much of their teaching around use of the tidyverse. The tidyverse, in the words of its developers, "is a collection of R packages that sha…
▽ More
Computing makes up a large and growing component of data science and statistics courses. Many of those courses, especially when taught by faculty who are statisticians by training, teach R as the programming language. A number of instructors have opted to build much of their teaching around use of the tidyverse. The tidyverse, in the words of its developers, "is a collection of R packages that share a high-level design philosophy and low-level grammar and data structures, so that learning one package makes it easier to learn the next". These shared principles have led to the widespread adoption of the tidyverse ecosystem. A large part of this usage is because the tidyverse tools have been intentionally designed to ease the learning process and make it easier for users to learn new functions as they engage with additional pieces of the larger ecosystem. Moreover, the functionality offered by the packages within the tidyverse spans the entire data science cycle, which includes data import, visualisation, wrangling, modeling, and communication. We believe the tidyverse provides an effective and efficient pathway for undergraduate students at all levels and majors to gain computational skills and thinking needed throughout the data science cycle. In this paper, we introduce the tidyverse from an educator's perspective. We provide a brief introduction to the tidyverse, demonstrate how foundational statistics and data science tasks are accomplished with the tidyverse, and discuss the strengths of the tidyverse, particularly in the context of teaching and learning.
△ Less
Submitted 22 April, 2022; v1 submitted 7 August, 2021;
originally announced August 2021.
-
Facilitating team-based data science: lessons learned from the DSC-WAV project
Authors:
Chelsey Legacy,
Andrew Zieffler,
Benjamin S. Baumer,
Valerie Barr,
Nicholas J. Horton
Abstract:
While coursework provides undergraduate data science students with some relevant analytic skills, many are not given the rich experiences with data and computing they need to be successful in the workplace. Additionally, students often have limited exposure to team-based data science and the principles and tools of collaboration that are encountered outside of school. In this paper, we describe th…
▽ More
While coursework provides undergraduate data science students with some relevant analytic skills, many are not given the rich experiences with data and computing they need to be successful in the workplace. Additionally, students often have limited exposure to team-based data science and the principles and tools of collaboration that are encountered outside of school. In this paper, we describe the DSC-WAV program, an NSF-funded data science workforce development project in which teams of undergraduate sophomores and juniors work with a local non-profit organization on a data-focused problem. To help students develop a sense of agency and improve confidence in their technical and non-technical data science skills, the project promoted a team-based approach to data science, adopting several processes and tools intended to facilitate this collaboration. Evidence from the project evaluation, including participant survey and interview data, is presented to document the degree to which the project was successful in engaging students in team-based data science, and how the project changed the students' perceptions of their technical and non-technical skills. We also examine opportunities for improvement and offer insight to other data science educators who may want to implement a similar team-based approach to data science projects at their own institutions.
△ Less
Submitted 21 October, 2021; v1 submitted 21 June, 2021;
originally announced June 2021.
-
Integrating computing in the statistics and data science curriculum: Creative structures, novel skills and habits, and ways to teach computational thinking
Authors:
Nicholas J. Horton,
Johanna S. Hardin
Abstract:
Nolan and Temple Lang (2010) argued for the fundamental role of computing in the statistics curriculum. In the intervening decade the statistics education community has acknowledged that computational skills are as important to statistics and data science practice as mathematics. There remains a notable gap, however, between our intentions and our actions. In this special issue of the *Journal of…
▽ More
Nolan and Temple Lang (2010) argued for the fundamental role of computing in the statistics curriculum. In the intervening decade the statistics education community has acknowledged that computational skills are as important to statistics and data science practice as mathematics. There remains a notable gap, however, between our intentions and our actions. In this special issue of the *Journal of Statistics and Data Science Education* we have assembled a collection of papers that (1) suggest creative structures to integrate computing, (2) describe novel data science skills and habits, and (3) propose ways to teach computational thinking. We believe that it is critical for the community to redouble our efforts to embrace sophisticated computing in the statistics and data science curriculum. We hope that these papers provide useful guidance for the community to move these efforts forward.
△ Less
Submitted 22 December, 2020;
originally announced December 2020.
-
Implementing version control with Git and GitHub as a learning objective in statistics and data science courses
Authors:
Matthew D. Beckman,
Mine Çetinkaya-Rundel,
Nicholas J. Horton,
Colin W. Rundel,
Adam J. Sullivan,
Maria Tackett
Abstract:
A version control system records changes to a file or set of files over time so that changes can be tracked and specific versions of a file can be recalled later. As such, it is an essential element of a reproducible workflow that deserves due consideration among the learning objectives of statistics and data science courses. This paper describes experiences and implementation decisions of four co…
▽ More
A version control system records changes to a file or set of files over time so that changes can be tracked and specific versions of a file can be recalled later. As such, it is an essential element of a reproducible workflow that deserves due consideration among the learning objectives of statistics and data science courses. This paper describes experiences and implementation decisions of four contributing faculty who are teaching different courses at a variety of institutions. Each of these faculty have set version control as a learning objective and successfully integrated one such system (Git) into one or more statistics courses. The various approaches described in the paper span different implementation strategies to suit student background, course type, software choices, and assessment practices. By presenting a wide range of approaches to teaching Git, the paper aims to serve as a resource for statistics and data science instructors teaching courses at any level within an undergraduate or graduate curriculum.
△ Less
Submitted 4 November, 2020; v1 submitted 7 January, 2020;
originally announced January 2020.
-
Data scraping, ingestation, and modeling: bringing data from cars.com into the intro stats class
Authors:
Sarah McDonald,
Nicholas Jon Horton
Abstract:
New tools have made it much easier for students to develop skills to work with interesting data sets as they begin to extract meaning from data. To fully appreciate the statistical analysis cycle, students benefit from repeated experiences collecting, ingesting, wrangling, analyzing data and communicating results. How can we bring such opportunities into the classroom? We describe a classroom acti…
▽ More
New tools have made it much easier for students to develop skills to work with interesting data sets as they begin to extract meaning from data. To fully appreciate the statistical analysis cycle, students benefit from repeated experiences collecting, ingesting, wrangling, analyzing data and communicating results. How can we bring such opportunities into the classroom? We describe a classroom activity, originally developed by Danny Kaplan (Macalester College), in which students can expand upon statistical problem solving by hand-scraping data from cars.com, ingesting these data into R, then carrying out analyses of the relationships between price, mileage, and model year for a selected type of car.
△ Less
Submitted 9 September, 2018;
originally announced September 2018.
-
Greater data science at baccalaureate institutions
Authors:
Amelia McNamara,
Nicholas J. Horton,
Benjamin S. Baumer
Abstract:
Donoho's JCGS (in press) paper is a spirited call to action for statisticians, who he points out are losing ground in the field of data science by refusing to accept that data science is its own domain. (Or, at least, a domain that is becoming distinctly defined.) He calls on writings by John Tukey, Bill Cleveland, and Leo Breiman, among others, to remind us that statisticians have been dealing wi…
▽ More
Donoho's JCGS (in press) paper is a spirited call to action for statisticians, who he points out are losing ground in the field of data science by refusing to accept that data science is its own domain. (Or, at least, a domain that is becoming distinctly defined.) He calls on writings by John Tukey, Bill Cleveland, and Leo Breiman, among others, to remind us that statisticians have been dealing with data science for years, and encourages acceptance of the direction of the field while also ensuring that statistics is tightly integrated.
As faculty at baccalaureate institutions (where the growth of undergraduate statistics programs has been dramatic), we are keen to ensure statistics has a place in data science and data science education. In his paper, Donoho is primarily focused on graduate education. At our undergraduate institutions, we are considering many of the same questions.
△ Less
Submitted 24 October, 2017;
originally announced October 2017.
-
Updated guidelines, updated curriculum: The GAISE College Report and introductory statistics for the modern student
Authors:
Beverly L. Wood,
Megan Mocko,
Michelle Everson,
Nicholas J. Horton,
Paul Velleman
Abstract:
Since the 2005 American Statistical Association's (ASA) endorsement of the Guidelines for Assessment and Instruction in Statistics Education (GAISE) College Report, changes in the statistics field and statistics education have had a major impact on the teaching and learning of statistics. We now live in a world where "Statistics - the science of learning from data - is the fastest-growing science,…
▽ More
Since the 2005 American Statistical Association's (ASA) endorsement of the Guidelines for Assessment and Instruction in Statistics Education (GAISE) College Report, changes in the statistics field and statistics education have had a major impact on the teaching and learning of statistics. We now live in a world where "Statistics - the science of learning from data - is the fastest-growing science, technology, engineering, and math (STEM) undergraduate degree in the United States," according to the ASA, and where many jobs demand an understanding of how to explore and make sense of data. In light of these new reports and other changes and demands on the discipline, a group of volunteers revised the 2005 GAISE College Report. The updated report was endorsed by the Board of Directors of the American Statistical Association in July 2016. To help shed additional light on the revision process and subsequent changes in the report, we review the report and share insights into the committee's thoughts and assumptions.
△ Less
Submitted 26 May, 2017;
originally announced May 2017.
-
Data Visualization on Day One: Bringing Big Ideas into Intro Stats Early and Often
Authors:
Xiaofei Wang,
Cynthia Rush,
Nicholas Jon Horton
Abstract:
In a world awash with data, the ability to think and compute with data has become an important skill for students in many fields. For that reason, inclusion of some level of statistical computing in many introductory-level courses has grown more common in recent years. Existing literature has documented multiple success stories of teaching statistics with R, bolstered by the capabilities of R Mark…
▽ More
In a world awash with data, the ability to think and compute with data has become an important skill for students in many fields. For that reason, inclusion of some level of statistical computing in many introductory-level courses has grown more common in recent years. Existing literature has documented multiple success stories of teaching statistics with R, bolstered by the capabilities of R Markdown. In this article, we present an in-class data visualization activity intended to expose students to R and R Markdown during the first week of an introductory statistics class. The activity begins with a brief lecture on exploratory data analysis in R. Students are then placed in small groups tasked with exploring a new dataset to produce three visualizations that describe particular insights that are not immediately obvious from the data. Upon completion, students will have produced a series of univariate and multivariate visualizations on a real dataset and practiced describing them.
△ Less
Submitted 23 May, 2017;
originally announced May 2017.
-
A mean score method for sensitivity analysis to departures from the missing at random assumption in randomised trials
Authors:
Ian R. White,
James Carpenter,
Nicholas J. Horton
Abstract:
Most analyses of randomised trials with incomplete outcomes make untestable assumptions and should therefore be subjected to sensitivity analyses. However, methods for sensitivity analyses are not widely used. We propose a mean score approach for exploring global sensitivity to departures from missing at random or other assumptions about incomplete outcome data in a randomised trial. We assume a s…
▽ More
Most analyses of randomised trials with incomplete outcomes make untestable assumptions and should therefore be subjected to sensitivity analyses. However, methods for sensitivity analyses are not widely used. We propose a mean score approach for exploring global sensitivity to departures from missing at random or other assumptions about incomplete outcome data in a randomised trial. We assume a single outcome analysed under a generalised linear model. One or more sensitivity parameters, specified by the user, measure the degree of departure from missing at random in a pattern mixture model. Advantages of our method are that its sensitivity parameters are relatively easy to interpret and so can be elicited from subject matter experts; it is fast and non-stochastic; and its point estimate, standard error and confidence interval agree perfectly with standard methods when particular values of the sensitivity parameters make those standard methods appropriate. We illustrate the method using data from a mental health trial.
△ Less
Submitted 2 May, 2017;
originally announced May 2017.
-
Enriching students' conceptual understanding of confidence intervals: An interactive trivia-based classroom activity
Authors:
Xiaofei Wang,
Nicholas G. Reich,
Nicholas J. Horton
Abstract:
Confidence intervals provide a way to determine plausible values for a population parameter. They are omnipresent in research articles involving statistical analyses. Appropriately, a key statistical literacy learning objective is the ability to interpret and understand confidence intervals in a wide range of settings. As instructors, we devote a considerable amount of time and effort to ensure th…
▽ More
Confidence intervals provide a way to determine plausible values for a population parameter. They are omnipresent in research articles involving statistical analyses. Appropriately, a key statistical literacy learning objective is the ability to interpret and understand confidence intervals in a wide range of settings. As instructors, we devote a considerable amount of time and effort to ensure that students master this topic in introductory courses and beyond. Yet, studies continue to find that confidence intervals are commonly misinterpreted and that even experts have trouble calibrating their individual confidence levels. In this article, we present a ten-minute trivia game-based activity that addresses these misconceptions by exposing students to confidence intervals from a personal perspective. We describe how the activity can be integrated into a statistics course as a one-time activity or with repetition at intervals throughout a course, discuss results of using the activity in class, and present possible extensions.
△ Less
Submitted 29 January, 2017;
originally announced January 2017.
-
Using a "Study of Studies" to help statistics students assess research findings
Authors:
Azka Javaid,
Xiaofei Wang,
Nicholas J Horton
Abstract:
One learning goal of the introductory statistics course is to develop the ability to make sense of research findings in published papers. The Atlantic magazine regularly publishes a feature called "Study of Studies" that summarizes multiple articles published in a particular domain. We describe a classroom activity to develop this capacity using the "Study of Studies." In this activity, students r…
▽ More
One learning goal of the introductory statistics course is to develop the ability to make sense of research findings in published papers. The Atlantic magazine regularly publishes a feature called "Study of Studies" that summarizes multiple articles published in a particular domain. We describe a classroom activity to develop this capacity using the "Study of Studies." In this activity, students read capsule summaries of twelve research papers related to restaurants and dining that was published in April 2015. The selected papers report on topics such as how seating arrangement, server posture, plate color and size, and the use of background music relate to revenue, ambiance, and perceived food quality. The students are assigned one of the twelve papers to read and critique as part of a small group. Their group critiques are shared with the class and the instructor.
A pilot study was conducted during the 2015-2016 academic year at Amherst College. Students noted that key details were not included in the published summary. They were generally skeptical of the published conclusions. The students often provided additional summarization of information from the journal articles that better describe the results. By independently assessing and comparing the original study conclusions with the capsule summary in the "Study of Studies," students can practice developing judgment and assessing the validity of statistical results.
△ Less
Submitted 29 January, 2017;
originally announced January 2017.
-
Challenges and opportunities for statistics and statistical education: looking back, looking forward
Authors:
Nicholas Jon Horton
Abstract:
The 175th anniversary of the ASA provides an opportunity to look back into the past and peer into the future. What led our forebears to found the association? What commonalities do we still see? What insights might we glean from their experiences and observations? I will use the anniversary as a chance to reflect on where we are now and where we are headed in terms of statistical education amidst…
▽ More
The 175th anniversary of the ASA provides an opportunity to look back into the past and peer into the future. What led our forebears to found the association? What commonalities do we still see? What insights might we glean from their experiences and observations? I will use the anniversary as a chance to reflect on where we are now and where we are headed in terms of statistical education amidst the growth of data science. Statistics is the science of learning from data. By fostering more multivariable thinking, building data-related skills, and developing simulation-based problem solving, we can help to ensure that statisticians are fully engaged in data science and the analysis of the abundance of data now available to us.
△ Less
Submitted 28 April, 2015; v1 submitted 7 March, 2015;
originally announced March 2015.
-
Setting the stage for data science: integration of data management skills in introductory and second courses in statistics
Authors:
Nicholas J. Horton,
Benjamin S. Baumer,
Hadley Wickham
Abstract:
Many have argued that statistics students need additional facility to express statistical computations. By introducing students to commonplace tools for data management, visualization, and reproducible analysis in data science and applying these to real-world scenarios, we prepare them to think statistically. In an era of increasingly big data, it is imperative that students develop data-related c…
▽ More
Many have argued that statistics students need additional facility to express statistical computations. By introducing students to commonplace tools for data management, visualization, and reproducible analysis in data science and applying these to real-world scenarios, we prepare them to think statistically. In an era of increasingly big data, it is imperative that students develop data-related capacities, beginning with the introductory course. We believe that the integration of these precursors to data science into our curricula-early and often-will help statisticians be part of the dialogue regarding "Big Data" and "Big Questions".
△ Less
Submitted 1 February, 2015;
originally announced February 2015.
-
Data Science in Statistics Curricula: Preparing Students to "Think with Data"
Authors:
Johanna Hardin,
Roger Hoerl,
Nicholas J. Horton,
Deborah Nolan
Abstract:
A growing number of students are completing undergraduate degrees in statistics and entering the workforce as data analysts. In these positions, they are expected to understand how to utilize databases and other data warehouses, scrape data from Internet sources, program solutions to complex problems in multiple languages, and think algorithmically as well as statistically. These data science topi…
▽ More
A growing number of students are completing undergraduate degrees in statistics and entering the workforce as data analysts. In these positions, they are expected to understand how to utilize databases and other data warehouses, scrape data from Internet sources, program solutions to complex problems in multiple languages, and think algorithmically as well as statistically. These data science topics have not traditionally been a major component of undergraduate programs in statistics. Consequently, a curricular shift is needed to address additional learning outcomes. The goal of this paper is to motivate the importance of data science proficiency and to provide examples and resources for instructors to implement data science in their own statistics curricula. We provide case studies from seven institutions. These varied approaches to teaching data science demonstrate curricular innovations to address new needs. Also included here are examples of assignments designed for courses that foster engagement of undergraduates with data and data science.
△ Less
Submitted 4 August, 2015; v1 submitted 12 October, 2014;
originally announced October 2014.
-
R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics
Authors:
Ben Baumer,
Mine Cetinkaya-Rundel,
Andrew Bray,
Linda Loi,
Nicholas J. Horton
Abstract:
Nolan and Temple Lang argue that "the ability to express statistical computations is an essential skill." A key related capacity is the ability to conduct and present data analysis in a way that another person can understand and replicate. The copy-and-paste workflow that is an artifact of antiquated user-interface design makes reproducibility of statistical analysis more difficult, especially as…
▽ More
Nolan and Temple Lang argue that "the ability to express statistical computations is an essential skill." A key related capacity is the ability to conduct and present data analysis in a way that another person can understand and replicate. The copy-and-paste workflow that is an artifact of antiquated user-interface design makes reproducibility of statistical analysis more difficult, especially as data become increasingly complex and statistical methods become increasingly sophisticated. R Markdown is a new technology that makes creating fully-reproducible statistical analysis simple and painless. It provides a solution suitable not only for cutting edge research, but also for use in an introductory statistics course. We present evidence that R Markdown can be used effectively in introductory statistics courses, and discuss its role in the rapidly-changing world of statistical computation.
△ Less
Submitted 8 February, 2014;
originally announced February 2014.
-
Teaching precursors to data science in introductory and second courses in statistics
Authors:
Nicholas J Horton,
Benjamin S Baumer,
Hadley Wickham
Abstract:
Statistics students need to develop the capacity to make sense of the staggering amount of information collected in our increasingly data-centered world. Data science is an important part of modern statistics, but our introductory and second statistics courses often neglect this fact. This paper discusses ways to provide a practical foundation for students to learn to "compute with data" as define…
▽ More
Statistics students need to develop the capacity to make sense of the staggering amount of information collected in our increasingly data-centered world. Data science is an important part of modern statistics, but our introductory and second statistics courses often neglect this fact. This paper discusses ways to provide a practical foundation for students to learn to "compute with data" as defined by Nolan and Temple Lang (2010), as well as develop "data habits of mind" (Finzer, 2013). We describe how introductory and second courses can integrate two key precursors to data science: the use of reproducible analysis tools and access to large databases. By introducing students to commonplace tools for data management, visualization, and reproducible analysis in data science and applying these to real-world scenarios, we prepare them to think statistically in the era of big data.
△ Less
Submitted 14 January, 2014;
originally announced January 2014.
-
Adjusting models of ordered multinomial outcomes for nonignorable nonresponse in the occupational employment statistics survey
Authors:
Nicholas J. Horton,
Daniell Toth,
Polly Phipps
Abstract:
An establishment's average wage, computed from administrative wage data, has been found to be related to occupational wages. These occupational wages are a primary outcome variable for the Bureau of Labor Statistics Occupational Employment Statistics survey. Motivated by the fact that nonresponse in this survey is associated with average wage even after accounting for other establishment character…
▽ More
An establishment's average wage, computed from administrative wage data, has been found to be related to occupational wages. These occupational wages are a primary outcome variable for the Bureau of Labor Statistics Occupational Employment Statistics survey. Motivated by the fact that nonresponse in this survey is associated with average wage even after accounting for other establishment characteristics, we propose a method that uses the administrative data for imputing missing occupational wage values due to nonresponse. This imputation is complicated by the structure of the data. Since occupational wage data is collected in the form of counts of employees in predefined wage ranges for each occupation, weighting approaches to deal with nonresponse do not adequately adjust the estimates for certain domains of estimation. To preserve the current data structure, we propose a method to impute each missing establishment's wage interval count data as an ordered multinomial random variable using a separate survival model for each occupation. Each model incorporates known auxiliary information for each establishment associated with the distribution of the occupational wage data, including geographic and industry characteristics. This flexible model allows the baseline hazard to vary by occupation while allowing predictors to adjust the probabilities of an employee's salary falling within the specified ranges. An empirical study and simulation results suggest that the method imputes missing OES wages that are associated with the average wage of the establishment in a way that more closely resembles the observed association.
△ Less
Submitted 31 July, 2014; v1 submitted 3 January, 2014;
originally announced January 2014.
-
I hear, I forget. I do, I understand: a modified Moore-method mathematical statistics course
Authors:
Nicholas Jon Horton
Abstract:
Moore introduced a method for graduate mathematics instruction that consisted primarily of individual student work on challenging proofs (Jones, 1977). Cohen (1982) described an adaptation with less explicit competition suitable for undergraduate students at a liberal arts college. This paper details an adaptation of this modified Moore-method to teach mathematical statistics, and describes ways t…
▽ More
Moore introduced a method for graduate mathematics instruction that consisted primarily of individual student work on challenging proofs (Jones, 1977). Cohen (1982) described an adaptation with less explicit competition suitable for undergraduate students at a liberal arts college. This paper details an adaptation of this modified Moore-method to teach mathematical statistics, and describes ways that such an approach helps engage students and foster the teaching of statistics.
Groups of students worked a set of 3 difficult problems (some theoretical, some applied) every two weeks. Class time was devoted to coaching sessions with the instructor, group meeting time, and class presentations. R was used to estimate solutions empirically where analytic results were intractable, as well as to provide an environment to undertake simulation studies with the aim of deepening understanding and complementing analytic solutions. Each group presented comprehensive solutions to complement oral presentations. Development of parallel techniques for empirical and analytic problem solving was an explicit goal of the course, which also attempted to communicate ways that statistics can be used to tackle interesting problems. The group problem solving component and use of technology allowed students to attempt much more challenging questions than they could otherwise solve.
△ Less
Submitted 28 September, 2013;
originally announced September 2013.