Computational Probability and Statistics

Authors

Bradley Warner

Brianna Hitt

Ken Horton

Published

October 29, 2024

Preface

This book is based on the notes we created for our students as part of a one semester course on probability and statistics. We developed these notes from three primary resources. The most important is the OpenIntro Introductory Statistics with Randomization and Simulation (ISRS) (Diez, Barr, and Çetinkaya-Rundel 2014) book. In parts, we have used their notes and homework problems. However, in most cases we have altered their work to fit our needs. The second most important book for our work is Introduction to Probability and Statistics Using R (Kerns 2010). Finally, we have used some examples, code, and ideas from the first edition of Prium’s book, Foundations and Applications of Statistics: An Introduction Using R (R. J. Pruim 2011).

In a 2024 reorganization of our inference block, we revised our inference case study and added a chapter on sampling distributions. The materials for the case study utilized the OpenIntro ISRS (Diez, Barr, and Çetinkaya-Rundel 2014) and Introduction to Modern Statistics (2e) (Çetinkaya-Rundel and Hardin 2024). We have altered their work to fit our needs, primarily by piecing together information from their inference block. Additionally, the new sampling distributions chapter borrows heavily from the OpenIntro Statistics (Diez, Çetinkaya-Rundel, and Barr 2019) and from the sampling distributions lessons by Skew the Script (Skew the Script 2024). We have used their materials with minor modifications to transform the lesson activities into a book chapter.

Who is this book for?

We designed this book for the study of statistics that maximizes computational ideas while minimizing algebraic symbol manipulation. Although we do discuss traditional small-sample, normal-based inference and some of the classical probability distributions, we rely heavily on ideas such as simulation, permutations, and the bootstrap. This means that students with a background in differential and integral calculus will be successful with this book.

This book makes extensive using of the R programming language. In particular we focus both on the tidyverse and mosaic packages. We include a significant amount of code in our notes and frequently demonstrate multiple ways of completing a task. We have used this book for junior and sophomore college students.

Book structure and how to use it

This book is divided into four parts. Each part begins with a case study that introduces many of the main ideas of each part. Each chapter is designed to be a standalone 50 minute lesson. Within each chapter, we give exercises that can be worked in class and we provide learning objectives.

This book assumes students have access to R. Finally, we keep the number of homework problems to a reasonable level and assign all problems.

The four parts of the book are:

  1. Descriptive Statistical Modeling: This part introduces the student to data collection methods, summary statistics, visual summaries, and exploratory data analysis.

  2. Probability Modeling: We discuss the foundational ideas of probability, counting methods, and common distributions. We use both calculus and simulation to find moments and probabilities. We introduce basic ideas of multivariate probability. We include method of moments and maximum likelihood estimators.

  3. Inferential Statistical Modeling: We discuss many of the basic inference ideas found in a traditional introductory statistics class but we add ideas of bootstrap and permutation methods.

  4. Predictive Statistical Modeling: The final part introduces prediction methods, mainly in the form of linear regression. This part also includes inference for regression.

The learning outcomes for this course are to use computational and mathematical statistical/probabilistic concepts for:

  1. Developing probabilistic models.
  2. Developing statistical models for description, inference, and prediction.
  3. Advancing practical and theoretical analytic experience and skills.

Prerequisites

To take this course, students are expected to have completed calculus up through and including integral calculus. We do have multivariate ideas in the course, but they are easily taught and don’t require previous exposure to calculus III (multivariable calculus). We don’t assume the students have any programming experience and, thus, we include a great deal of code. We have historically supplemented the course with Data Camp courses. We have also used Posit Cloud to help students get started in R without the burden of loading and maintaining software.

Packages

These notes make use of the following packages in R: knitr (Xie 2024), rmarkdown (Allaire et al. 2024), mosaic (R. Pruim, Kaplan, and Horton 2024), tidyverse (Wickham 2023), ISLR (James et al. 2021), vcd (Meyer et al. 2023), ggplot2 (Wickham et al. 2024), MASS (Ripley 2024), openintro (Çetinkaya-Rundel et al. 2024), broom (Robinson, Hayes, and Couch 2024), infer (Bray et al. 2024), kableExtra (Zhu 2024), and DT (Xie, Cheng, and Tan 2024).

Acknowledgements

We have been lucky to have numerous open sources to help facilitate this work. Thank you to those who helped to provide edits including Jessica Hauschild, Justin Graham, Kris Pruitt, Matt Davis, and Skyler Royse.

This book is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

File Creation Information

  • File creation date: 2024-10-29
  • R version 4.4.1 (2024-06-14)

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2024. Rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.
Bray, Andrew, Chester Ismay, Evgeni Chasnovski, Simon Couch, Ben Baumer, and Mine Cetinkaya-Rundel. 2024. Infer: Tidy Statistical Inference. https://github.com/tidymodels/infer.
Çetinkaya-Rundel, Mine, David Diez, Andrew Bray, Albert Y. Kim, Ben Baumer, Chester Ismay, Nick Paterno, and Christopher Barr. 2024. Openintro: Datasets and Supplemental Functions from OpenIntro Textbooks and Labs. http://openintrostat.github.io/openintro/.
Çetinkaya-Rundel, Mine, and Johanna Hardin. 2024. Introduction to Modern Statistics. 2nd ed. OpenIntro. https://openintro-ims.netlify.app/.
Diez, David, Christopher Barr, and Mine Çetinkaya-Rundel. 2014. Introductory Statistics with Randomization and Simulation. 1st ed. Openintro. https://www.openintro.org/book/isrs/.
Diez, David, Mine Çetinkaya-Rundel, and Christopher D Barr. 2019. OpenIntro Statistics. 4th ed. OpenIntro. https://www.openintro.org/book/os/.
James, Gareth, Daniela Witten, Trevor Hastie, and Rob Tibshirani. 2021. ISLR: Data for an Introduction to Statistical Learning with Applications in r. https://www.statlearning.com.
Kerns, Jay. 2010. Introductory to Probability and Statistics with r. 1st ed. http://ipsur.r-forge.r-project.org/book/download/IPSUR.pdf.
Meyer, David, Achim Zeileis, Kurt Hornik, and Michael Friendly. 2023. Vcd: Visualizing Categorical Data.
Pruim, Randall J. 2011. Foundations and Applications of Statistics: An Introduction Using r. Vol. 13. American Mathematical Soc.
Pruim, Randall, Daniel T. Kaplan, and Nicholas J. Horton. 2024. Mosaic: Project MOSAIC Statistics and Mathematics Teaching Utilities. https://github.com/ProjectMOSAIC/mosaic.
Ripley, Brian. 2024. MASS: Support Functions and Datasets for Venables and Ripley’s MASS. http://www.stats.ox.ac.uk/pub/MASS4/.
Robinson, David, Alex Hayes, and Simon Couch. 2024. Broom: Convert Statistical Objects into Tidy Tibbles. https://broom.tidymodels.org/.
Skew the Script. 2024. “Sampling Distributions.” https://skewthescript.org/ap-stats-curriculum/part-6.
Wickham, Hadley. 2023. Tidyverse: Easily Install and Load the Tidyverse. https://tidyverse.tidyverse.org.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, Dewey Dunnington, and Teun van den Brand. 2024. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://ggplot2.tidyverse.org.
Xie, Yihui. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Xie, Yihui, Joe Cheng, and Xianying Tan. 2024. DT: A Wrapper of the JavaScript Library DataTables. https://github.com/rstudio/DT.
Zhu, Hao. 2024. kableExtra: Construct Complex Table with Kable and Pipe Syntax. http://haozhu233.github.io/kableExtra/.