Linear Regression:

The basics of linear regression and implementation for data science in Python are discussed in this article.

Linear regression is a statistical approach for modeling a relationship between a dependent variable and given set of independent variables.

Here we will refer dependent variables as response and independent variables as features.

Simple linear regression:

It is an approach for the prediction of a response using a single response.

It can be assumed that the two variables are linearly related. We will try to find a linear function that will predict the response value as accurately as possible as a function of the independent variable.

Linear Regression in Python for Data Science - PST Analytics

In general we will define:

The scatter plot of the above dataset will look like:

Linear Regression in Python for Data Science - PST Analytics

Now we have to find a line which will best fit the above scatter plot so that we are able to predict the response for new feature values.

The line will be called regression line.

The equation of the above regression line will be:

Linear Regression in Python for Data Science - PST Analytics

For the creation of the model, we must learn or estimate values of regression coefficients b_0 and b_1. When we have estimated these coefficients, we will use the model for prediction of responses.

Here we will use least squares technique.

Linear Regression in Python for Data Science - PST Analytics

Now we will implement it in Python on our small dataset.

Linear Regression in Python for Data Science - PST Analytics Linear Regression in Python for Data Science - PST Analytics

The graph which we obtain will be :

Multiple linear regressions:

It attempts to model the relationship between two or more features and a response by fitting linear equation to observed data.

It is a simple extension of Simple linear regression.

We will consider a dataset with p features and one response. The dataset also contains n rows or observations.

Now we will define:

Below we have implementation of multiple linear regression techniques on Boston house pricing dataset.

The residual error plot will look like:

In the above example, we will determine the accuracy score using Explained variance score.

We will define:


Now we will observe some of the assumptions which are made by the linear regression model regarding the dataset on which it is applied.

  • Linear relationship: The relationship between response and features should be linear. We can use scatter plot for testing linearity assumption. From the figure we can see that figure one will represent linearly related variables and in second and third figures it will be non-linear. So, first figure will give better predictions.
  • Little or no multi-collinearity: We have to assume that there is little or no multi-collinearity in data. Multicollinearity can be seen when features are not independent from each other.
  • Little or no auto-correlation: We will also assume there is no or little autocorrelation in data. Autocorrelation will occur in case the residual errors are dependent from each other.
  • Homoscedasticity: It describes a situation in which the error term is similar across all values of independent variables. From the figure below we will observe the first figure has homoscedasticity and figure 2 has heteroscedasticity.
  • Trend lines: It represents variation in some quantitative data with the passage of time. The trends follow a linear relationship. We can predict future values using linear regression. This suffers from lack of scientific validity when there there are other potential changes affecting the data.
  • Economics: Linear regression tool is predominant empirical tool in economics. For example it can be used for predicting consumption spending, inventory investment, spending on imports, labor demand, and labor supply.
  • Finance: In the capital price asset model we will use linear regression for analyzing and quantifying systematic risks of an investment.
  • Biology: Linear regression is used for modeling casual relationships between parameters in biological systems.

So, to learn more about it in python for data science, you can check this and this as well. These blogs will help you in getting more insights about how data industry works. If you are an aspiring data scientist then you must be looking for a proper course. PST Analytics is an institute that offers that kind of course.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.