Basics of Linear Regression

4 min readJun 17, 2021

What is Linear Regression

Linear regression is a supervised learning technique to find the best trend line/Regression Line.

Introduction

Linear regression models the relationship between two variables(dependant & independent) by fitting a linear equation to the given or observed data.

Independent Variable : A variable whose value is independent of other variables in the data set.
Dependant Variable:A variable whose value is dependant on other variables in the data set.

Eg: Lets assume that we have a data set which has two columns

Column 1 : No.Of Bedrooms

Column 2: Price

Here “number of bedrooms” is an “independent” variable and “Price” is “dependant” variable .Reason being, any change in the number of bedrooms, effects the overall price of the home. If a house has more number of bedrooms, the price would be more and if the house has fewer bedrooms, the price would be less.

Lets take the above mentioned columns as a data set

Creating a scatter plot with number of Bedrooms (X) on x-axis & Price (Y) on y-axis as shown below:

We see a linearity in the above graph, meaning we can see that the points in plot could fall along a straight line.

How does the algorithm work:

The goal of the algorithm is to create a regression line such that the vertical distance from the line to each of the data points is minimum. This vertical distance is called residual. The smaller the residual ,the better the line to qualify as our regression line which shall be used to predict the future Prices.

How does the algorithm do this?

Linear regression deals with a straight line and we know that the eq of a line is y = ax+b. Here a is the slope(the direction and the steepness of the line) & b is the y intercept( the point where the line touches the y axis).

If we can find the values of both a and b ,then by simply inserting the future “number of bedrooms” as input (which would be the variable ‘x’ of the equation), we can find the future values of “Price” as output (which is the variable ‘y’ of the equation)

How does the algorithm calculate the values of a & b?

By using the below formula

x = independent variable

y= dependant variable

It takes the summation of x ,y ,xy and x²

Here “n” is the total number of rows ,as per our example it is 5.Making the calculations as per the formula

Solving for a and b above we get the values as

a = 2411

b =2176

substituting the values in the line equation we get

y = (2411)x +2176

Now to test how good the line is lets substitute with the current data points from the table 1, we have

x=1 y =5000. (row 1 of table 1 values taken for substitution)

Substituting

2411(1)+2176 =4587

So we got y =4587 as the price but in original price is 5000.Lets get the y values for the other data points of table1 ,we get as below

Now using this eq as we see that the distance from each and every data point to the regression line is small ,we use this eq to predict the future prices of a house when number of bedrooms is given as the x or input value.

Summary:

Linear regression algorithm decides the best line eq in the background when we use it in our programming language.But which dataset qualifies for Linear regression? Can we take any data and use a LR algorithm on it to predict?Are we supposed look for some certain aspects of the data before applying LR?

For a data to qualify for a LR algorithm ,there are certain assumptions that need to be met first.I shall put a separate writeup on the same.