Assumptions of Linear Regression

Pallavi Satsangi
3 min readJun 28, 2021
Photo by Black Jiracheep on Unsplash

This writeup is a follow up of “Basics of Linear Regression”. You can read the same here .We concluded the article with a note as to what kind of data qualify for a LR.One needs to analyze the dataset before qualifying LR to be run on the the data . If one or more of these assumptions are breached, then the results of our linear regression may be reckless or even misleading.

Before listing them out let's have a few definitions in place.

Residual: The vertical distance between each and every data point to the regression line.This can also be defined as the actual y value to the predicted y value.(called yhat)

Variance: quantifies how far a data set is spread out. Mathematically it is defined as the average of the squared differences from the mean.

Homoscedasticity: describes a situation in which the error term is same across all the values of the independent variable. We can say as the same scatter all the way OR we can also say it as a condition where in the residuals have a constant spread/variance for all values of X(independent variable)

With this lets get into the list of assumptions.

1.Linearity : There needs to be a linear relationship between the independent variable, X, and the dependent variable, Y.

How is it determined if this condition is met ?

One of the simple technique is to plot a scatter plot of X vs. Y. Doing this will allow us to graphically see if there is a linear relationship between the two variables. If the plot depicts that the points in the plot could fall along a straight line, then we can say that there exists some type of linear relationship between the two variables and we can conclude that this assumption is met.

2. Homoscedasticity: We saw above from the definition of homoscedasticity is that the residuals should have a constant variance at every level of x.

How is it determined if this condition is met ?

Once we fit our regression line to a set of data points, we can then create a scatter plot of fitted values of the model vs. the residuals of those fitted values.The plot should show us that the residuals do not increase or spread out as and when the value of the fitted variable increases. If this happens we can say that this assumption is met.

3.Normality: The residuals of the model need to be normally distributed.

How is it determined if this condition is met ?

We have few tests that can be used to test if this assumption is met. Tests like Shapiro-Wilk,D’Agostino-Pearson etc . However,these tests are too sensitive to large data set and hence the tests might just end up to conclude that the residuals are not normally distributed which may be not the reality. Therefore it is better to just use graphical methods like a Q-Q plot to check this particular assumption. To briefly explain Q-Q plot, (short form for quantile-quantile plot), is a type of plot that is used to determine whether or not the residuals of a model are normally distributed. If the points roughly form in the shape of straight diagonal line,then we can say that this assumption is met.

4.Independence:The residuals should be independent which means there is no correlation between consecutive residuals in the dataset.This is usually relevant when using time series data.

How is it determined if this condition is met ?

We have few formal testes which can be performed like Durbin-Watson test. We can also plot a graph taking residuals Vs time and see where the residual correlation fall in terms of confidence interval .If they fall within the 95% confidence interval around zero then we can say that this assumption is met.

--

--

Pallavi Satsangi

Project Manager| Machine Learning |Data Science|Natural Language Processing|Neural Networks| MSc. Business Analytics