Machine Learning for Biostatistics (MLB)

Recent years have brought a rapid growth in the amount and complexity of health data captured, requiring new statistical techniques in both predictive and descriptive learning. Machine learning algorithms for classification and prediction, complement classical statistical tools in the analysis of these data. This unit will cover modern machine learning methods particularly useful for large and complex health data.

Andrew Grant
Dr Andrew Grant University of Sydney, Sydney School of Public Health Semester 2
General outline


Epidemiology, Mathematical Foundations for Biostatistics, Principles of Statistical Inference, Regression Modelling for Biostatistics 1

Time commitment

8-12 hours total study time per week

Semester availability

Semester 2


Two major assignments worth 40% each (equivalent to 2 x 2000 words) and two short assignments worth 10% each.

Prescribed Texts

James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning with Applications in R. Springer, 2003. (freely available online:

Special Computer Requirements

R and RStudio


The topics covered include: Linear Regression and K -Nearest Neighbors; Classification (logistic regression, linear discriminant analysis); Resampling Methods (Cross-Validation, Bootstrap); Model Selection and Regularization (subset selection, shrinkage methods, dimension reduction methods); Beyond Linearity (fractional polynomials, basis functions, splines, generalized additive models); Tree-Based Methods (decision trees, bagging, random forests, boosting).


Course notes, online mini-lecture videos, online tutorials, discussion board

The BCA acknowledges we live and work on the ancestral lands of Aboriginal and Torres Strait Islander peoples, who have for thousands of generations exchanged knowledge for the benefit of all. We pay our respects to those who have cared and continue to care for Country.