Five Obstacles faced in Linear Regression

These five obstacles may occur when you train a linear regression model on your data set.

Aayush Ostwal
Towards Data Science

--

Let's go from Yellow, the color of danger to Yellow, the color of sunshine, and happiness. (Photo by Casey Thiebeau on Unsplash)

Linear Regression is one of the most trivial machine algorithms. Interpretability and easy-to-train traits make this algorithm the first steps in Machine Learning. Being a little less complicated, Linear Regression acts as one of the fundamental concepts in understanding higher and complex algorithms.

To know what linear regression is? How we train it? How we obtain the best fit line? How we interpret it? And how we access the accuracy of fit, you may visit the following article.

After understanding the basic intuition of Linear regression, certain concepts make it more fascinating and more fun. These also provide a deep understanding of flaws in the algorithm, its impact, and remedies. And, we will explore these concepts in the article.

We all know, Linear regression involves a few assumptions. And, these assumptions make the structure of this algorithm straightforward. However, this is the reason why it has lots of flaws and why we need to study and understand these flaws.

This article discusses the problems that may occur while training a Linear model, and some methods to deal with them.

Five problems that lie in the scope of this article are:

  1. Non-Linearity of the response-predictor relationships
  2. Correlation of error terms
  3. A non-constant variance of the error term [Heteroscedasticity]
  4. Collinearity
  5. Outliers and High Leverage Points

Non-Linearity of the response-predictor relationships

Source:

The reason for this problem is one of the assumptions involved in linear regression. It is the assumption for linearity, which states that the relation between the predictor and response is linear.

If the actual relation between response and the predictor is not linear, then all the conclusion we draw becomes null and void. Also, the accuracy of the model may drop significantly.

So, how can we deal with this problem?

Remedy:

The solution to the problem mentioned above is to plot Residual Plots.

Residual plots are the plot between the residual, the difference between the actual value and predicted value, and the predictor.

Once we have plotted the residual plot, we will search for a pattern. If some patterns are visible, then there is a non-linear relationship between response and predictor. And, if the plot shows randomness then we are on the right path!

After analyzing the type of pattern, we can use non-linear transformations such as square root, cube root, or log function. Which removes the non-linearity to some extent, and our linear model performs well.

Example:

Let try to fit a straight line to a quadratic function. We will generate some random points using NumPy and take their squares as the response.

import numpy as np
x = np.random.rand(100)
y = x*x
sns.scatterplot(x,y)

Let us see the scatter plot between x and y (Fig.1).

Fig.1

Now, let us try to fit a linear model to this data and see the plot between residual and predictor.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x.reshape(-1,1),y.reshape(-1,1))
predictions = model.predict(x.reshape(-1,1))
residual = y.reshape(-1,1) - predictions
Fig.2

We can see a quadratic trend in the residual plots. This trend helps us to identify the non-linearity in data. Further, we can apply the square root transformation to make data more suitable for the linear model.

If the data is linear, then you would get random points. The nature of the residual would be randomized. In that case, we can move forward with the model.

Correlation of error terms

Source:

A principal assumption of the linear model is that the error terms are uncorrelated. The “uncorrelated” terms indicated that the sign of error for one observation is independent of others.

The correlation among error terms may occur due to several factors. For instance, if we are observing the weight and height of people. The correlation in error may occur due to the diet they consume, the exercise they do, environmental factors, or they are members of the same family.

What happens to the model when errors are correlated? If the error terms are correlated then the standard error in the model coefficients gets underestimated. As a result, confidence and prediction intervals will be narrower than they should be.

For more insights, please refer to the example below.

Remedies:

The solution is the same as described in the above problem, Residual Plots. If some trends are visible in residual plots, these trends can be expressed as some functions. Hence, they are correlated!

Example:

To understand the impact of correlation on the confidence interval, we should note two trivial points.

  1. When we estimate model parameters, there is some error (Standard Error: SE) involved. This error arises due to the estimation of population characteristics from the sample. This error is inversely proportional to the square root of the number of observations.
  2. The confidence interval for the model parameters with 95% confidence varies by two standard errors. (Please refer to Fig.3)
Fig.3 Confidence Interval for model parameters.

Now, suppose we have n data points. We calculate the standard error (SE) and confidence interval. Now, we doubled our data. Hence, then we would have observations and error terms in pair.

If we now recalculate the SE, then we will calculate it corresponding to 2n observations. As a result, the standard error will be lower by a factor of root √2 (SE is inversely proportional to the number of observations). And, we will obtain a narrower confidence interval.

A non-constant variance of the error term [Heteroscedasticity]

Source:

The source of this problem is also an assumption. The assumption is that the error term has a constant variance, also referred to as Homcedacity.

Generally, that is not the case. We can often identify a non-constant variance in errors, or heteroscedasticity, from the presence of funnel shape in residual plots. In Fig.2, the funnel represents that the error terms have non-constant variance.

Remedies:

One possible solution is to transform the response using a concave function such as log and square root. Such a transformation results in shrinkage of the response variable, consequently reducing heteroscedasticity.

Example:

Let us try to apply log transformation to points generated in problem 1.

Fig.4 Concave Transformation

We can observe a linear trend after transformation. Hence we may remove non-linearity by applying concave functions.

Collinearity

Collinearity refers to a situation in which two or more predictor variables are correlated to one another. For example, we can find some relation between height and weight, Area of house and number of rooms, experience, and income, and many more.

Source:

In linear regression, we assume that all the predictors are independent. But often the case is the opposite. The predictors are correlated with each other. Hence, it is essential to look at this problem and find a feasible solution.

When the assumption of independence is neglected, the following concerns arise:

  1. We cannot infer the individual effect of predictors on response. Because they are correlated, change in one variable try to impart change in another. Therefore, the accuracy of model parameters drops significantly.
  2. When the accuracy of model parameters drops, all our conclusion becomes void. We can not tell the actual relation between response and predictor and hence, model accuracy also decreases.

Remedies:

There are two possible solutions to the problem.

  1. Drop the variable: We can drop the problematic variable from the regression. The intuition is that the collinearity implies that the information provided by the variable in the presence of other variables, is redundant. Hence, we can drop the variable without much compromise.
  2. Combining the variable: We can combine both the variables to form a new variable. These techniques are feature engineering. For example, merge weight and height to get BMI (Body mass index).
Fig.5 Combining Correlated Variables

Outliers And High Leverage Points

Linear Regression is greatly affected by the presence of Outliers and Leverage points. They may occur for a variety of reasons. And their presence hugely affects to model performance. It is also one of the limitations of linear regression.

Outlier: An outlier is an unusual observation of response y, for some given predictor x.

High Leverage Points: Contrast to an outlier, a high leverage point is defined as an unusual observation of predictor x.

Fig.6 Outliers and High Leverage points in a scatter plot

There are several techniques available for identifying an outlier. This includes interquartile range, scatter plots, residual plots, quartile-quartile plots, box plots, etc.

As this is a limitation of linear regression, it is vital to take the necessary steps. One method is to drop the outlier. However, this may lead to some loss of information. We can also use feature engineering to deal with outliers.

Summary

In this article, we have seen five problems while we are working with linear regression. We have seen the sources, impacts, and solutions for each of the problems.

Though Linear regression is the most basic machine learning algorithm, it has a vast scope for learning new things. For me, these problems provide are different point of view for Linear regression.

I hope understanding these problems will provide you with novel insights when you solve any problem.

You may also check the complete playlist for Linear regression.

Happy Learning!

--

--