Objectives
Descriptive Statistical Modeling
1 - Data Case Study
Use
R
for basic analysis and visualization.Compile a PDF report using
knitr
.
2 - Data Basics
Differentiate between various statistical terminologies such as case, observational unit, variables, data frame, tidy data, numerical variable, discrete numeric, continuous numeric, categorical variable, levels, scatterplot, associated variables, and independent, and construct examples to demonstrate their proper use in context.
Within a given dataset, evaluate different types of variables and justify their classifications (e.g. categorical, discrete numerical, continuous numerical).
Given a study description, develop an appropriate research question and justify the organization of data as tidy.
Create and interpret scatterplots using
R
to analyze the relationship between two numerical variables by evaluating the strength and direction of the association.
3 - Overview of Data Collection Principles
Differentiate between various statistical terminologies such as population, sample, anecdotal evidence, bias, simple random sample, systematic sample, representative sample, non-response bias, convenience sample, explanatory variable, response variable, observational study, cohort, experiment, randomized experiment, and placebo, and construct examples to demonstrate their proper use in context.
Evaluate descriptions of research project to identify the population of interest, assess the generalizability of the study, determine the explanatory and response variables, classify the study as observational or experimental, and determine the type of sample used.
Design and justify sampling procedures for various research contexts, comparing the strengths and weaknesses of different sampling methods (simple random, systematic, convenience, etc.) and proposing improvements to minimize bias and enhance representativeness.
4 - Studies
Differentiate between various statistical terminologies such as observational study, confounding variable, prospective study, retrospective study, simple random sampling, stratified sampling, strata, cluster sampling, multistage sampling, experiment, randomized experiment, control, replicate, blocking, treatment group, control group, blinded study, placebo, placebo effect, and double-blind, and construct examples to demonstrate their proper use in context.
Evaluate study descriptions using appropriate terminology, and analyze the study design for potential biases or confounding variables.
Given a scenario, identify and assess flaws in reasoning, and propose robust study and sampling methodologies to address these flaws.
5 - Numerical Data
Differentiate between various statistical terminologies such as scatterplot, mean, distribution, point estimate, weighted mean, histogram, data density, right skewed, left skewed, symmetric, mode, unimodal, bimodal, multimodal, variance, standard deviation, density plot, box plot, median, interquartile range, first quartile, third quartile, whiskers, outlier, robust estimate, and transformation, and construct examples to demonstrate their proper use in context.
Using
R
, generate and interpret summary statistics for numerical variables.Create and evaluate graphical summaries of numerical variables using
R
, choosing the most appropriate types of plots for different data characteristics and research questions.Synthesize numerical and graphical summaries to provide interpretations and explanations of a data set.
6 - Categorical Data
Differentiate between various statistical terminologies such as factor, contingency table, marginal counts, joint counts, frequency table, relative frequency table, bar plot, conditioning, segmented bar plot, mosaic plot, pie chart, side-by-side box plot, and density plot, and construct examples to demonstrate their proper use in context.
Using
R
, generate and interpret tables for categorical variables.Using
R
, generate and interpret summary statistics for numerical variables by groups.Create and evaluate graphical summaries of both categorical and numerical variables using
R
, selecting the most appropriate visualization techniques for different types of data and research questions.Synthesize numerical and graphical summaries to provide interpretations and explanations of a data set.
Probability Modeling
7 - Probability Case Study
Use
R
to simulate a probabilistic model.Gain an introduction to probabilistic thinking through computational, mathematical, and data science approaches.
8 - Probability Rules
Differentiate between various statistical terminologies such as sample space, outcome, event, subset, intersection, union, complement, probability, mutually exclusive, exhaustive, independent, multiplication rule, permutation, and combination, and construct examples to demonstrate their proper use in context.
Apply basic probability properties and counting rules to calculate the probabilities of events in different scenarios. Interpret the calculated probabilities in context.
Explain and illustrate the basic axioms of probability.
Use
R
to perform calculations and simulations for determining the probabilities of events.
9 - Conditional Probability
Define and differentiate between conditional probability and joint probability, and provide real-world examples to illustrate these concepts and their differences.
Calculate conditional probabilities from given data or scenarios using their formal definition, and interpret these probabilities in the context of practical examples.
Using conditional probability, determine whether two events are independent and justify your conclusion with appropriate calculations and reasoning.
Apply Bayes’ Rule to solve problems both mathematically and through simulation usinng
R
.
10 - Random Variables
Differentiate between various statistical terminologies such as random variable, discrete random variable, continuous random variable, sample space/support, probability mass function, cumulative distribution function, moment, expectation, mean, and variance, and construct examples to demonstrate their proper use in context.
For a given discrete random variable, derive and interpret the probability mass function (pmf) and apply this function to calculate the probabilities of various events.
Simulate random variables for a discrete distribution using
R
.Calculate and interpret the moments, such as expected value/mean and variance, of a discrete random variable.
Calculate and interpret the expected value/mean and variance of a linear transformation of a random variable.
11 - Continuous Random Variables
Differentiate between various statistical terminologies such as probability density function (pdf) and cumulative distribution function (cdf) for continuous random variables, and construct examples to demonstrate their proper use in context.
For a given continuous random variable, derive and interpret the probability density function (pdf) and apply this function to calculate the probabilities of various events.
Calculate and interpret the moments, such as the expected value/mean and variance, of a continuous random variable.
12 - Named Discrete Distributions
Differentiate between common discrete distributions (Uniform, Binomial, Poisson) by identifying their parameters, assumptions, and moments. Evaluate scenarios to determine the most appropriate distribution to model various types of data.
Apply
R
to calculate probabilities and quantiles, and simulate random variables for common discrete distributions.
13 - Named Continuous Distributions
Differentiate between common continuous distributions (Uniform, Exponential, Normal) by identifying their parameters, assumptions, and moments. Evaluate scenarios to determine the most appropriate distribution to model various types of data.
Apply
R
to calculate probabilities and quantiles, and simulate random variables for common continuous distributions.State and apply the empirical rule (68-95-99.7 rule).
Explain the relationship between the Poisson process and the Poisson and Exponential distributions, and describe how these distributions model different aspects of the same process.
Apply the memory-less property in context of the Exponential distribution and use it to simplify probability calculations.
14 - Multivariate Distributions
Differentiate between joint probability mass/density functions (pmf/pdf), marginal pmfs/pdfs, and conditional pmfs/pdfs, and provide real-world examples to illustrate these concepts and their differences.
For a given joint pmf/pdf, derive the marginal and conditional pmfs/pdfs through summation or integration or using
R
.Apply joint, marginal, and conditional pmfs/pdfs to calculate probabilities of events involving multiple random variables.
15 - Multivariate Expectation
Given a joint pmf/pdf, calculate and interpret the expected values/means and variances of random variables and functions of random variables.
Differentiate between covariance and correlation, and given a joint pmf/pdf, calculate and interpret the covariance and correlation between two random variables.
Given a joint pmf/pdf, determine whether random variables are independent of one another and justify your conclusion with appropriate calculations and reasoning.
Calculate and interpret conditional expectations for given joint pmfs/pdfs.
16 - Transformations
Determine the distribution of a transformed discrete random variable using appropriate methods, and use it to calculate probabilities.
Determine the distribution of a transformed continuous random variable using appropriate methods, and use it to calculate probabilities.
Determine the distribution of a transformation of multivariate random variables using simulation, and use it to calculate probabilities.
17 - Estimation Methods
Apply the method of moments to estimate parameters or sets of parameters from given data.
Derive the likelihood function for a given random sample from a distribution.
Derive a maximum likelihood estimate of a parameter or set of parameters.
Calculate and interpret the bias of an estimator by analyzing its expected value relative to the true parameter.
Inferential Statistical Modeling
18 - Inferential Thinking Case Study
Using bootstrap methods, obtain and interpret a confidence interval for an unknown parameter, based on a random sample.
Conduct a hypothesis test using a randomization test, to include all 4 steps.
19 - Sampling Distributions
- Differentiate between various statistical terminologies such as point estimate, parameter, sampling error, bias, sampling distribution, and standard error, and construct examples to demonstrate their proper use in context.
- Construct a sampling distribution for various statistics, including the sample mean, using
R
. - Using a sampling distribution, make decisions about the population. In other words, understand the effect of sampling variation on our estimates.
20 - Bootstrap
Differentiate between various statistical terminologies such as sampling distribution, bootstrapping, bootstrap distribution, resample, sampling with replacement, and standard error, and construct examples to demonstrate their proper use in context.
Apply the bootstrap technique to estimate the standard error of a sample statistic.
Utilize bootstrap methods to construct and interpret confidence intervals for unknown parameters from random samples.
Analyze and discuss the advantages, disadvantages, and underlying assumptions of bootstrapping for constructing confidence intervals.
21 - Hypothesis Testing with Simulation
Differentiate between various statistical terminologies such as null hypothesis, alternative hypothesis, test statistic, p-value, randomization test, one-sided test, two-sided test, statistically significant, significance level, type I error, type II error, false positive, false negative, null distribution, and sampling distribution, and construct examples to demonstrate their proper use in context.
Apply and evaluate all four steps of a hypothesis test using randomization methods: formulating hypotheses, calculating a test statistic, determining the p-value through randomization, and making a decision based on the test outcome.
Analyze and discuss the concepts of decision errors (type I and type II errors), the differences between one-sided and two-sided tests, and the impact of choosing a significance level. Evaluate how these factors influence the conclusions and reliability of hypothesis tests and their practical implications in statistical decision-making.
Analyze how confidence intervals and hypothesis testing complement each other in making statistical inferences.
22 - Hypothesis Testing with Known Distributions
Differentiate between various statistical terminologies such as permutation test, exact test, null hypothesis, alternative hypothesis, test statistic, p-value, and power, and construct examples to demonstrate their proper use in context.
Apply and evaluate all four steps of a hypothesis test using probability models: formulating hypotheses, calculating a test statistic, determining the p-value through randomization, and making a decision based on the test outcome.
23 - Hypothesis Testing with the Central Limit Theorem
Explain the central limit theorem and when it can be used for inference.
Apply the CLT to conduct hypothesis tests using
R
and interpret the results with an understanding of the CLT’s role in justifying normal approximations.Analyze the relationship between the \(t\)-distribution and normal distribution, explain usage contexts, and evaluate how changes in parameters impact their shape and location using visualizations.
24 - Confidence Intervals
Apply asymptotic methods based on the normal distribution to construct and interpret confidence intervals for unknown parameters.
Analyze the relationships between confidence intervals, confidence level, and sample size.
Analyze how confidence intervals and hypothesis testing complement each other in making statistical inferences.
25 - Additional Hypothesis Tests
Conduct and interpret a goodness of fit test using both Pearson’s chi-squared and randomization to evaluate the independence between two categorical variables. Evaluate the assumptions for Pearson’s chi-square test.
Analyze the relationship between the chi-squared and normal distributions, explain usage contexts, and evaluate the effects of changing degrees of freedom on the chi-squared distribution using visualizations.
Conduct and interpret a hypothesis test for equality of two means and equality of two variances using both permutation and the CLT. Evaluate the assumptions for two-sample t-tests.
26 - Analysis of Variance
- Conduct and interpret a hypothesis test for equality of two or more means using both permutation and the \(F\) distribution. Evaluate the assumptions of ANOVA.
Predictive Statistical Modeling
27 - Linear Regression Case Study
Using
R
, generate a linear regression model and use it to produce a prediction model.Using plots, check the assumptions of a linear regression model.
28 - Linear Regression Basics
Differentiate between various statistical terminologies such as response, predictor, linear regression, simple linear regression, coefficients, residual, and extrapolation, and construct examples to demonstrate their proper use in context.
Estimate the parameters of a simple linear regression model using a given sample of data.
Interpret the coefficients of a simple linear regression model.
Create and evaluate scatterplots with regression lines.
Identify and assess the assumptions underlying linear regression models.
29 - Linear Regression Inference
Apply statistical inference methods for \(\beta_0\) and \(\beta_1\), and evaluate the implications for the predictor-response relationship.
Write the estimated simple linear regression model and calculate and interpret the predicted response for a given value of the predictor.
Construct and interpret confidence and prediction intervals for the response variable.
30 - Linear Regression Diagnostics
Calculate and interpret the R-squared and F-statistic for a linear regression model. Evaluate these metrics to assess the model’s goodness-of-fit and overall significance.
Use
R
to evaluate the assumptions underlying a linear regression model.Identify, analyze, and explain the impact of outliers and leverage points in a linear regression model.
31 - Simulated-Based Linear Regression
Apply the bootstrap to generate and interpret confidence intervals and estimates of standard error for parameter estimates in a linear regression model.
Apply the bootstrap to generate and interpret confidence intervals for predicted values from a linear regression model.
Generate bootstrap samples using two methods: sampling rows of the data and sampling residuals. Justify why you might prefer one method over the other.
Generate and interpret regression coefficients in a linear regression model with categorical explanatory variables.
32 - Multiple Linear Regression
Generate and interpret the coefficients of a multiple linear regression model. Asses the assumptions underlying multiple linear regression models.
Write the estimates multiple linear regression model and calculate and interpret the predicted response for given values of the predictors.
Generate and interpret confidence intervals for parameter estimates in a multiple linear regression model.
Generate and interpret confidence and prediction intervals for predicted values in a multiple linear regression model.
Explain the concepts of adjusted R-squared and multicollinearity in the context of multiple linear regression.
Develop and interpret multiple linear regression models that include higher-order terms, such as polynomial terms or interaction effects.
33 - Logistic Regression
Apply logistic regression using
R
to analyze binary outcome data. Interpret the regression output, and perform model selection.Write the estimated logistic regression, and calculate and interpret the predicted outputs for given values of the predictors.
Calculate and interpret confidence intervals for parameter estimates and predicted probabilities in a logistic regression model.
Generate and analyze a confusion matrix to evaluate the performance of a logistic regression model.