Machine Learning 101: Evaluating regression models, MAE, MSE, RMSE, R-squared explained

You’ve probably already started your Data Science journey by now, and have implemented your first Linear Regression model. Great, but how do you know your model is good? How good it is? How good it is compared to other models (spoiler: there’s no right answer)? Today you’ll learn error metrics (some of them) and how to evaluate regression models.

Classification vs Regression

Machine Learning essentially deals with two kinds of problems:

  • Classification: predicting a class, for example whether a user is male or female (the two classes) given their history of purchased items.
  • Regression: predicting a value, for example the price (the value) of a used car given the model, the age, the kilometers on the odometer.

It is important to remember that Machine Learning is no magic, ML algorithms are still algorithms: multiple inputs, one output. The most important difference between a traditional algorithm and an ML one is the “experience” the ML algorithm gains during the training phase.

In Classification problems the algorithm tries to predict the class the entry will fall into, it may be two classes (such as the example above, male versus female) or more than two classes. The former is often called Binary Classification the latter is referred to as Multiclass Classification.

In Regression there is no class to predict, instead there is a scale and the algorithm tries to predict the value on that scale. In the example above the price is the sought value.

Distances and error metrics

Most tutorials and bootcamps out there will throw metrics at you without explaining (they imply you know) what they really represent, what intuition is behind those cryptic acronyms such as MAE or RMSE. I do believe it is essential for a data scientist or anyone working with predictive models to understand what these metrics are, what they represent and how to use them.

Back to the first question: “how do you know your model is good?”. The answer to this one is simple: “let’s measure it somehow!”. That’s when you know you need an error measurement or more commonly a metric used to describe “how big the error is“. But how do you know how big it is? That’s when you start thinking about errors in real life.

You may remember when you learned how to use a ruler, and you mistakenly drew a line that was just 1cm shorter or longer than it ought to have been (we’ve all been there). So suppose your perfect line ought to have been 10cm, you drew it 9cm long. You made a mistake, your error was 1cm.

That 1cm is really just a distance. A distance between what it is and what it should have been. That’s it! For regression problems we’ll use distances, to quantify the error and know whether the model is good or not. But why is that?

Because at the end of the day you’re still just drawing lines, maybe in two dimensions, or maybe in n-dimensions. Think of the Linear Regression isn’t it a line that your model is drawing between points? Now that we’ve cleared the intuition behind errors and distances let’s take a look at some metrics (that you probably had already seen).

MAE (Mean Absolute Error)

Before we take a look at the mean, let’s take a look at the Absolute Error:

$$AE = \left |x-\hat{x} \right |$$

Since an error can’t be a negative number the absolute value is used. The $\hat{x}$ is what you measured, the $x$ is what you expected it to be.

Back to the perfect-line you never drew in school, whether you drew it 11cm or 9cm the error is still 1cm. Now imagine you want to know how precise you are at drawing lines. You decide to draw many lines and measure each one and measure their errors (absolute). Then you will take the average of those errors.

$$MAE = \frac{1}{n} \sum_{i=1}^{n} \left |x_i-\hat{x}_i \right |$$

Then you suddenly understood what Mean Absolute Errors is, kudos! It wasn’t that difficult, right? Observe that MAE uses the same unit of measurement of the original measurement.

MSE (Mean Squared Error)

Let’s start using an example, imagine you have to draw four lines: let x=[2, 4, 8, 12] be the real values you expect (in centimeters), let y=[2, 3, 9, 11] and z=[2, 4, 5, 12] be two measurement, you can think of two people drawing the lines. Who’s the best at drawing lines y or z?

Next you calculate the MAE of y and z. Surprise: 0.75(cm) each! You could very well say it’s a tie. However notice that y made three errors of 1cm and z made one error of 3cm.

Y is very precise across the whole four lines, while Z sometimes makes very big mistakes. Using MAE they look the same. You might think that you want to punish Z because he made such a big error. So how do you do this?

Remember that we measure errors as distances, a distance can be thought of a line (a segment). Now if you square that segment you will get an area (of the square with that segment as the side). Now the area of the square with side 1 is still 1, 2 yields 4, 3 yields 9. We’re now ready to define the Mean Squared Error:

$$MSE = \frac{1}{n} \sum_{i=1}^{n} \left (x_i-\hat{x}_i \right )^2$$

Essentially the MSE is a squared version of the MAE. MSE punishes values that are distant from the expected value. The more the distance, the more the punishment. You don’t need the absolute value because you’re squaring the difference.

Let’s now calculate the MSE for y and z from the previous example: y is still 0.75 while z is 2.25. Remind that both y and z had the same MAE: 0.75cm. So in this case y MAE is equal to y MSE. Wrong!

Notice that I didn’t state the unit of measurement in MSE, can you guess what it is? The answer is $cm^2$. That is simply because you squared each measurement, and each unit has been squared as well. So MSE is $0.75cm^2$ for y and $2.25cm^2$ for z. Since you were drawing lines, you can’t really know if that is good or bad (MSE in this case are areas). Of course z is worse because its MSE is higher, but you still can’t quantify how much the error is compared to the lines. Takeaway: MSE doesn’t use the same unit of measurement f your data!

RMSE (Root Mean Squared Error)

RMSE is essentially MSE under a square root. What does it accomplish? The same as MSE but you get an error with the same unit of measurement.

$$RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left (x_i-\hat{x}_i \right )^2}$$

Let’s now calculate RMSE for y and z from the previous example: RMSE for y is ~0.8660cm while RMSE for z is 1.5cm. Since you extracted the root of MSE, the measurement returned in the same unit of measurement of your data. You can now say for sure: y is the winner.

R^2 (R-squared)

The last metric I want to show you is $r^2$ also known as coefficient of determination. Along with RMSE it is probably the most used metric when it comes to regression problems. While MSE, RMSE and MAE can assume different values depending on the input, r squared will always be a value between 0 and 1, the higher the value, the better the model. $r^2$ is not an error metric although it is often (mis)used as one. I won’t delve too deep in this one as you need some solid basics in statistics, but most of the times you’ll see $r^2$ described as the ratio of explained variance to the total variance.

$$r^2 = \frac{ESS}{TSS} = 1 – \frac{RSS}{TSS}$$

$$ESS = \sum_{i=1}^{n} (\hat{y}_i – \bar{y})^2$$

$$TSS = \sum_{i=1}^{n} (y_i – \bar{y})^2$$

$$RSS = \sum_{i=1}^{n} (y_i -\hat{y}_i)^2$$

In short:

$$r^2 = \sum_{i=1}^{n} \frac{ (\hat{y}_i – \bar{y})^2}{(y_i – \bar{y})^2} = 1 – \sum_{i=1}^{n} \frac{(y_i – \hat{y}_i)^2}{(y_i – \bar{y})^2}$$

Now it may seem daunting at first, but it really isn’t. Keep in mind:

  • $y_i$ is the real value,
  • $\bar{y}$ is the mean,
  • $\hat{y}_i$ is the predicted value.

If you look at the first formula (ESS/TSS) you can easily see that it is essentially the squared Absolute Error of $\hat{y}$ and the mean of $y$ divided by the Absolute Error $y$ and its mean. Meaning that it is the distance between the mean and the predicted values divided by the distance between each value and its mean value. It is not an easy concept to grasp without solid statistics foundations so don’t worry for now.

$r^2$ is the only metric presented in this article that uses a scale between 0 and 1. Because it doesn’t represent an error it is not an error metric. $r^2$ is often used as an error metric, but the truth is it only says “how much of x you can infer from y” and it can be very good ~1 or very bad ~0.


You now know of the most used error metrics (+ $r^2$) used in machine learning and data science for regression problems. While there are more error metrics out there, the ones presented in this article are by far the most used. When it comes to regression problems you’ve built an understanding of what MAE, MSE and RMSE represent, are used for and how to calculate them. Lastly you’re now aware of $r^2$ which is not an error metric but it is often used to evaluate regression models.

Image courtesy of mark | marksei

You may also like...

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.