Predictive Analytics for Data Science: Linear and Non-Linear Modelling: Online - (2 days)

This masterclass is an introduction to linear and non-linear predictive models. It will provide an interactive step-by-step guide to running these models and key diagnostics using the R software platform.


This masterclass is part of the ACSPRI suite of courses in social data science and is specially designed for those who want a gentle introduction to linear and non-linear predictive models in data science.








Master Class - runs over 2 days

Dr Joanna Dipnall is an applied statistician with interests in the advanced statistical methods, including machine learning and deep learning techniques. She completed her Honours in Econometrics with Monash University and her PhD with IMPACT SRC, School of Medicine, Deakin University. Joanna works extensively with registry and linked medical data and collaborates extensively with the Faculty of IT at Monash to supervise Masters and PhD students to integrate artificial intelligence within health research. Joanna teaches within the Monash Biostatistics Unit and is the Unit Co-coordinator for the Monash Masters of Health Data Analytics course. Joanna has taught advanced statistical methods for many years at universities and for ACSPRI.

About this course: 

Regression modelling is a foundation in data science and a must for anyone wanting to venture into this space. Understanding when and how to use linear and non-linear regression models in everyday research is an essential skill for any analyst. Linear and non-linear regression models are commonly used to quantify the relationship between two or more variables by predicting a key outcome of interest. These models are used as effective and powerful tools to control for the potential confounding effect of extraneous variables and/or developing highly predictive models.


Linear regression relates to continuous outcomes and is a fundamental regression technique in data science. Logistic regression is used when the outcome of interest is categorical and a fundamental classification technique in data science. When there is no theoretical or mechanistic model to suggest a particular functional form to describe the relationship between two or more variables of interest, Generalized Additive Models (GAMs) can used as they fit a nonparametric curve to the data without requiring pre-defining any particular mathematical model to describe the nonlinearity. Gaining a sound understanding of all these models is essential to understand when it is appropriate to use these techniques.


Upon completion of this masterclass, you will have the skills required to confidently run standard linear and non-linear models using the R statistical software platform. You will have gained an understanding of when each type of model is appropriate and be able to justify the use of your model using key diagnostics. The workshop is relevant to researchers and data analysts in any area of research that want to use linear and non-linear predictive models for their research work. This workshop aims to introduce these models, key diagnostics and build confidence in their use.

Course syllabus: 


Day 1: Linear Models

  • Introduction to linear regression
  • Regression diagnostics
  • Introduction to interactions
  • Use and reporting of linear models in publications
  • Exercises


Day 2: Non-Linear Models

  • Introduction to logistic regression
  • Regression diagnostic and checking accuracy of predications
  • Introduction to General Additive Models (GAMs)
  • Use and reporting of non-linear models in publications
  • Exercises


Course format: 


This course will be run online over 2 days.


Participants will require their own computers and to have loaded R and RStudio loaded onto their machines. They will also need to be able to access the internet to download R libraries. This course will be taught in the PC environment but MAC users are welcome.


Please note that due to the short 2-day structure, there will not be any time set aside for analysing participant’s own data.



Recommended Background: 


This course assumes that participants have:


  1. A basic understanding of statistical concepts pts including descriptive statistics (mean, median and interquartile range),
  2. A reasonable knowledge of using the R and RStudio software
  3. Some familiarity with a PC/Mac environment including keyboard skills,
  4. An understanding of folder and file structures in the PC/Mac environment, and
  5. Some experience in using Microsoft Word and Excel or their equivalent.


Recommended Texts: 

Data Analysis and Graphics Using R by John Maindonald and W. John Braun.


Regression Analysis with R: Design and develop statistical nodes to identify unique relationships within data at scale by Giuseppe Ciaburro.