Return to page


Accuracy Masterclass Part 1 - Choosing the “Right’ Metric for Success

Finding the "right" optimization metric for a data science problem may not be a straightforward task. One may expect that a “good” model would be able to get superior results versus all data science metrics available, however quite often this is not the case. This is a misconception. Therefore finding the metric most appropriate for the given objective is critical to building a scccessful data science model.

3 Main Learning Points

  • Get familiar with different popular regression and classification metrics
  • Find out about their strength and weaknesses
  • See examples of when one metric may be prefrered over the others

Read Transcript

All right. We have a very exciting clinic, at least for me. I hope it is for your topic today. We'll talk about metrics. And this is something that I had my fair share of exposure to, both within my Kaggle tenure, I've done a lot of Kaggle and have participated in over 200 challenges. So I have seen first there are different types of problems and different types of metrics used to optimize for this. But I had the great opportunity to work with many different clients and affiliates of h2o. And it is always an interesting and very important topic, which metric should we choose? For different problems? Sometimes it's critical to our success, we need to find the right one, because the target or other elements within the data might have properties that would make some metrics better than others.

But let's move on.

So regarding metrics, as I said, people struggled to find the right metric. And rightfully so it's not very easy, it's not very straightforward. At times, every metric has pros and cons. And every metric gives intensity to different properties, we will see a bit, we see more about this later. It's really a myth that a good model will be able to give you superior results in all available metrics, I would say generally, it's generally true that a good model will give you fairly good results in most metrics. But if you really want to optimize for one, you should expect that performance may deteriorate in others, and that's fine. And, and you should be able to find the right one for your problem. Sometimes it could be a combination of metrics. But on other occasions, not every metric will work for you. So yeah, you need to really pick the metric that you want to optimize for that is more linked with the problem you're trying to solve in what's really important for you. We started with regression metrics, regression metrics are generally the ones that have given me more silence and more trouble in my experience. And I think that's because quite often, you know, errors in this context are not bounded, you know, you can have something very large. In that case, they're very vulnerable to outliers and other problems. So this is just my experience, I have seen more difficulty and more challenges. Generally, with regression metrics. Maybe it's just me. But that's what I have seen. Let's start with a very, very popular metric, Jelly will now dive into individual metrics. And we'll have a deeper look into them. And they're always in this format, some general tips about it, you can also see the formula if it helps, and also when to use them. And their focus really will be on when to use them. In this case, it may see an MSc so if you just remove those squares it just means square error, it is extremely popular. In fact, many algorithms optimize this by default. The general property about these types of error is that it all commands bigger errors because of that square. So their result when you remove the actual value from your prediction gets square. So a bigger one will become much much bigger, while a small error of let's say less than one after being in that square is going to be smaller. So this is the basic property of that error. Because it has that square that makes it more easily optimizable it's more easily differentiable. So a lot of different types of algorithms and packages really have an optimizer for it. It is vulnerable to outline errors because you know you can have one easily. You can have one error that is bigger for the rest and you can have a huge weight especially after Applying that square, which can make a big percentage of the total error being accumulated into a few cases. So this can get tricky. So having said that, when is RMSE? When does it seem the right choice for you is really when you cannot have very big errors you can live with, you can live with an avatar at a small, even a somewhat medium, let's say that are based on a number of observations, but you really want to avoid those very big ones. Because you want to avoid the very big outlier, the sensor lead, this is how I have it in my mind. So when, when you see a prediction of 200, then narrow sorry, follows is an error of 200. In that error, to you, it seems more important than two times 100.

Then RMSE feels like a better choice. So it's really that thing, I think you need to put into your mindset what the error also means to you, and to the business that you're into the problem you're trying to solve. So it's not just a modeling and a scientific question. It's also a risk utility question, like, you know, what, what the error means to you. So in this case, an error of 200. Or it might be that that notion becomes like this when we're talking about larger errors, but you know, when 200 feels more than twice worse than 100 RMSE or MSE might be the error for you. Moving on, another very popular metric, and you will see the difference with MSE, basically you don't have that square as the mean absolute error, very popular, not as easily optimizable, although there are many proxies, and that goes, it's not defined at zero. However, there are proxies that can optimize it, still I have seen cases where selecting RMSE can actually with an optimizer can give you better results, even for, for mean absolute error. And that's because a lot of these problems are, you know, the solvers are approximations, they're not perfect. It is also very popular, you know, it's also simple to explain, you know, mean absolute error, you just subtract the prediction from the target that take the absolute contribution of it, I guess many algorithms and different packets is net, you name it neural network, gradient boosting machines have an implementation for this. So in this case, all errors, and we're going to want to use it now, are analogously important. So when and that is 200 Feels like two times 100 which is what it is mathematically, then this is the right metric for you or could be the right metric for you. However, this is very often not the case, and I have seen this, you know, you interact with people, I can easily change, if I changed the background, if I told you you have infinite money in my bank, I can live with it truly a 200 error is twice you know, and 100 error. Now, if I tell you that I have only $150,000 in this case in my bank, I cannot really leave with the 200 error. So I don't want my model to be giving me errors that are that high. So in that case, I actually want that square to be applied here, given the context given my situation, which can be applied to any business context. And that's because I may have a different appetite for risk or or or equivalently. My might not be the right metric for me. So that is what I want people to think about as we go through this. When the error of 200 fields is what it is to ice There's 100, that is the right metric for you. If that is not the case, for whatever reason, you know, you become inflexible after a certain amount, or if the amount becomes as the amount becomes bigger, then my, maybe it's not the right metric for you, you want to switch to RMSE or something else. However it is, it is very popular. And you can see why it's so simple and straightforward as well. Now we go to something a little bit more tricky. It's a metric, which is very popular among business holders, it's similar to mine is the mean absolute percentage and error only you divide with the actual value as well. Business owners really like it, because it can express the error as a percentage. And you know, who doesn't like that? It makes it more easily comparable as well, across different different models.

And in this case, all letters are analogously weighted by the percentage of the errors, and you can immediately see that people here really ignore the volume. So if my error is 1000, in actuality it is 10,000, I have a 10% error. But if my error is 1 million, and my accidents are 10 million, again, I'm having a 10% error. So it's, it's the same contribution. Whereas if I have an actual one, or sorry, a prediction, let's say off one and the prediction of two, I'm having a 50% error. So with very small volume, I can have a much, much bigger error. So you can see how these may not be the right choice for certain problems. You need to avoid it centrally, there's not a good case, when you have really when your target variable takes zeros, you can immediately see you will basically be dividing with zero, this cannot happen. There are different ways to go about it, quite often we are the constant by you, different optimizers have approximations for us to try to deal with it. You may even exclude all these cases where the actual is zero from the optimization. But if you have negative values, zero values, also kind of very high rates and standard deviation, it really becomes very difficult to optimize for this metric. And you can easily see if, if my projects next time, let's say for tomorrow, can be any of anybody, you know, it could be one, or it could be 1 million, or it could be 10 million. And let's say I'm predicting 1 million, but the actual is one, which you know, could be in the stock market, a, you know, I can have a really huge MAPE. And I can have one observation that is basically a huge, huge outlier. And it really can affect my whole model. So it's not really ideal in these kinds of situations.

So, to use MAPE, you really need to look at the target variable. I understand it's easy to explain. But if you have lots of zeros, if you have a very big range with high standard deviation values, it's really a very difficult metric to make it work. Ideally, you want positive values that are you know, like away from zeros. And yeah, not ideally, maybe not very great or you want your range to be centered in like inserted positive values. However, you know, if you want to explain the results to stakeholders, or even if you want to be able to compare models in different industries or areas or levels, because of the percentage, you can make it more easily. Consumable, I'd say. So I think that's why this metric is also very popular in practice. So where may or may fail you because of some of the reasons that I just mentioned. Smith may help you. And basically the difference is that you add the prediction to the denominator. And now it gets to the maximum, you basically get a cap. So you cannot get that situation where one single case can have huge, huge concerns. So now it gets capped, I think the maximum can be at least 200%. And you're basically safe from the shoe trade Laird ruining your model, the issue you are facing this time is that by always adding their prediction, you're making it too easy, you're gonna make it too easy to the optimizer, I'm not sure if I can use this term. Hopefully it makes sense in this context. It becomes any model you build, where we'll look at maybe becoming too insensitive to target fluctuations. Now, you're you're really, you're really helping it, you're helping it a lot. And, and sometimes you want your machine learning algorithms, to be able to go deeper in to be forced to deal with with outliers and learn how to identify them, if possible, and so Smith has these side, negative side of it, but at the positive, you cannot have observations, if you observe races ruining ruining your model. So when can we use make basically when you can make you cannot make work, and you really want to to produce something that can be expressed as a percentage to be explained and used by stakeholders ardor r squared, that's an interesting metric. Where I think it's very nicely applicable. It's also where MAPE is applicable, in the same context, is when you want to compare different models, if you look at the formula, what you're basically doing is you're comparing a model that uses the error of the model that uses your predictions, that's just a very simple model that only uses the mean. So if we just use the mean , it is the most simple prediction we can make, right? You know, what if a student comes, try to predict what the age of a student will be, maybe I just tried to take the evidence aids of students in the school. And that's my prediction. And that could be in many cases, it could be a credible prediction, or I can use my machine learning model. So how better my model is from that very basic model that just uses the mean, this is essentially the R square metric. So it's essentially how better my model is from a baseline. That way we can compare models against, you know, like different industries or different, slightly different departments. For instance, if I tell you, the average salary is 10, this doesn't mean anything without contracts, you know, if if, then how many children somebody's going to have? If I get an average error of 10, that's, you know, that's very, that's a very big error, right? That'd be good.

Or if it is, you know, how much is your annual salary, then, you know, like $10 at, or that's a very good error, if I can predict, you know, the annual income of somebody with a minus plus than $10. That's a very good prediction, if I can get to that 10 Or so it really depends on the context, how, how good an error by you is, and this is what our software offers a way to compare models on different levels. And so how better is my model from an accurate prediction? This is what this metric essentially says. Obviously, the pitfall is that it does not tell you what the average error is. So you don't have these $10 or these 10 Children, you don't have that. You get that by you, between one and you can actually go to minus infinity. Which is not very, very, very nice. Hopefully, most of the time models are better than an evidence prediction. So this doesn't happen often, but still. So you, you get an idea of how good the model is. So if I tell you at Square 07 in any context, it feels like this is a good model. If I tell you it's zero 99 It feels like a very good model. Again, context matters. But you have a way to compare and understand, generally, if a model is good or not, not necessarily if it is useful, but if it's good or not, if it is much better than a baseline, quite often, as you can see, because it uses the squared error, it is optimizable by MSE and animacy solvers. So that makes it convenient. And, and yeah, it's, it's when you want to be able to compare models against different different industries or different departments or different levels, and you want to get a general idea of if your model is good or not, in general, I guess the baseline this is this is the metric that you use, often you have to accompany it with something else something else that measures the volume of the error as well.

And in that is another case of ah, you can actually sometimes you can use the Pearson correlation, and you can even square it as a proxy to r square, in some cases they can they can MIT can be the same. This is slightly different because it measures the degree by which the predicted value and the actual value move in unison. A, this can come to offer an alternative to the infinite negative value, the previous metric can give you as now it will be bounded to minus one and one minus one do have a negative a perfect negative correlation. So your prediction and the actual will be we will be moving at the opposite directions analogously or one both will be moving at the same directions analogously. This is the E the gunners the error and I'm giving you an example here in one case, I have a prediction where the error follows the prediction the actual follows that they're sorry, the prediction follows the actual very closely. And then another case with it doesn't, but Analogously, it does, it's always almost twice the other. So if one increases the other increases as well by an analogous amount. So in both cases, actually, this type of correlation in both cases is the same. So that is Pete all of this metric. As you can see so far, there is no metric that can give us everything. That is why you have to make a selection, but quite often you might need multiple. So it can be an alternative towards where, especially when you know you want to deal when you don't want to deal with that infinite negative by you. But really what it offers is a degree to understand the correlation between your prediction and the actual, again a different way to see if your model is working in general. And RMSA, messily also popular in certain industries, especially when we model counts and positive bias. The way I see or describe this metric is as a compressed version of RMSE. When again, as we move higher, the errors become more important, but not at the same scale. As we had when we had RMSE, RM SLE can be a good case, because the longer it offers that compression and that's the, that's the best way to describe it. So it becomes a version of RMSE that is less vulnerable to outliers. Because you can see, you can see here how the target basically transforms. Here you can see Council visitors over time, for instance, and you can see what this distribution looks like. So if you have an explosion, at some point RMSE will most definitely give a very, very high wait here. So the errors here probably are going to be higher, and the focus of the error is going to be here likely depending on the prediction. However, look how the target transforms if we take the logger and so more importance is added on the lower values as well so everything gets compressed. Again here is going to be more important if the error is higher, but Add, everything gets compressed. So it's a version that can give us more protection against outliers. It can also make it more insensitive. So it can suffer from the same thing that's made it if the same thing that comes to me, it's made less desirable than me. Because it can make your arguments less sensitive to fluctuations of the target. Probably less capable to capsule outliers. But at the same time, it is probably less likely for your model to fail. Quite often how we optimize this, we change the target. So with that, we add one to the target, take the local, and then we actually optimize for RMSE. This is what happens internally very often. So it becomes easily optimizable with a proxy. So we saw some metrics, obviously, these are not all, these are some popular ones.


And I think this is a good time to ask this question. Again, you saw how we challenge some of these metrics in how we saw where Sam could be more suitable for others, for instance, based on risk appetite, or based on the application, or your or your boss, or your business, problem or type. So do you feel more confident? Now after seeing this, in finding which metric will be more suitable for you? So maybe I should pause here a little bit and wait for people to. Okay, that's encouraging, I can see results. So we still have two people out of 13 that don't feel very confident. So it will be interesting to see more questions about these people. Later, as this is what I want to take away from this. I want to be able to help people to pick the right metric. Let us move on for now. And hopefully we can gather questions later on. Let's move on to classification metrics. I found classifica I have found classification metrics from my experience a bit more straightforward. And that's quite often because predictions could be bounded, bounded. So it could be between zero and one. Or it could be metrics like accuracy that again, are more straightforward. They can be less, they're certainly less prone to outliers. And more easily understood. There are obviously exceptions, but certainly that's my experience in the clarification matrix and it is a bit easier and more straightforward to deal with and also quite popular in practice. A type of error that often appears is the binary cross entropy. What this matrix tries to answer is a study of how close is the asymmetric that actually close to my or Adam is saying the sense we try to measure the distance we have, we have a sense of distance between our prediction and and and the actual. So, we try to get a prediction close to the actual in this case, that prediction is quite often in probability. So we are looking for a value that is most of the time between 00 and one. So when I presumptively try to make that prediction as close as possible. So, in that sense, it can be interpreted as a model that quite often gives you predictions that can be interpreted as probabilities, it is very popular and it is optimizable. Most packages will have an implementation for it. We should really use it when we care to get an estimate of a probability or we want to get as close to the probability of something happening as possible. When we are interested in that probability. We want to know if you know it's a 55% chance of you buying something of what I'm going to offer you or advertise or that is 60% or 65%. turns when I want to know that this is when I want to use this metric in practice. And why would that be important, you can imagine if I know that, let's say, I need I'm work in the default industry. And I know that if I have a bad customer that does not pay. As I'll be back, and I'm going to incur some loss, I need at least two good customers that will pay me back to cover for that loss. So I have already estimated this. So I know I need to be accepting, if I want to make a profit, I want my customers to have at least 60% chance to pay back. And that's because I know that I need good customers to repay a bad customer. So if I want profit, I need at least 66 660 2% probability for somebody to buy. So in that case, probability becomes important, this type of loss that I want, in this case, based on my business problem 

In other networks go to a series of metrics, accuracy, precision recall, and other that use different combinations of them, that use this element of the confusion matrix, even today that that confusion matrix sometimes confuses me as well. As I can, I can mix up the entries. Let's look at accuracy. These matters can be simple, but at the same time, I think by themselves, if you again, look one at a time, they are not telling you the whole picture. And you can easily see why I couldn't see it obviously, it's very, very important, very easily understood, expressed as a percentage. It basically answers out of the total predictions we made, how many we got right? It's not very easily optimizable, that formula is not very easy to optimize, there are different approximations for it, most of the times, you may optimize for a metric like like binary cross entropy that you saw before, which can be converted also, for multi class problems with small modifications, I forgot to mention that. And then you have a second search, a second step, in order to find the right kind of metric to optimize for accuracy. So say, if a score is higher, if my probability or my score is higher than x, then I'm going to deem it as a positive case. And I found a threshold that maximizes accuracy. This is how it is used in practice. So we have this two step approach, one that may optimize for a different metric. And then the second step to find the right threshold that optimizes where accuracy where this can be problematic is where you are not very insightful is when you have a very imbalanced problem. So consider winning the lottery. So very, very few people are going to win the lottery, right. So if I'm going to predict, nobody is going to predict a lottery, I'm going to get 99.99% accuracy or even more, because you know, basically, almost nobody waits for a few people out of all the people that are actually going to play. So you can imagine here, you know, you can see 99.9% accuracy, and you say, Whoa, that's a pretty good model, or you have here No, it's not. So it really depends on the distribution of your target and how imbalanced your problem is. And you immediately saw a case where accuracy may not be a good metric for you. So when there is it's very skewed, when it's heavily imbalanced towards one class, it might not be a good case on the positive side can be expressed as a percentage. You saw the question is answered out of the total prediction: how many we got right? That is easily consumable, very easily understood. And that's why it is quite often sought. Now we'll go to a more specialized case, it's its precision, which gives focus to the positive case. So in this case, when we try to optimize for precision, we try to answer how accurate we are on this specific case when we say if an event is going to happen. Again, this is not easily optimizable. We take an approach, which is similar to IQ, the same we optimize for something else and then we find a threshold where this might not be a great case. Again, the lottery prediction Precision is zero, because we are actually never going to predict that the event will happen, we are going to say, logically, if we want to get high accuracy in the previous problem that nobody is going to win the lottery. So in that case, because we never made the prediction that something is going to be won and positive, someone's going to win, our precision is zero in this case. So this metric, essentially, including gunners, all the time that we did not predict that the event will happen, it doesn't give any focus there. So it is in that sense, narrow focus, something else you need to note is that you can start getting higher precision, if you move that threshold high up, but then it will become a cost of missing the positives, because you want to become very certain that you're when you say something as positive is actually positive, you're going to put that threshold that is so high, that you're going to miss a lot of cases. And this is why sometimes precision, again, by itself might not give you what you want.

Quite often we might compute precision in different racks. And this is where it is very often using recommendations because we might offer people a list of items sometimes. And we want to see how good our model is, at getting an average precision, which is high, let's say on the first 10 items we present and offer five or the first 20. So we're interested in a whole list, or in that case, on different racks. So when do we use precision when we prepare proficiently, when we are mostly interested in how accurate we are when we are predicting if an event will happen? So on the positive side, for example, we were building a system for creativity. If we should increase the credit limit on a particular account, we want to be very, very sure in that case in order to avoid fraud, again, depending on the problem. But this is a typical way where we might look at precision, and as I mentioned is very, very popular and recommended as well. So given that they predicted that customers will buy an item, how accurate am I going to be on that subset, but they said they're going to buy the item. Now let's look at the different case recall. So whereas precision looks at the cases where we predicted they're going to be one, they're going to take a positive value. Recall actually looks across all positive values. So out of all the positive values, how many did we find, and this comes back to complement what I mentioned before, when I said this isn't only hears about a, we can get high precision, but we can actually by increasing that threshold that we accept that something is positive, but we can lose many of the positives that he called camps to tell us how many we actually lose. So it comes to complement precision. In that sense, obviously think glow, it ignores all the times the target was actually zero. In the lottery example, precision is we never predicted the true positive. So again, recall is zero. And look, check this now, although accuracy was 99%, when we predicted that no one was going to win the lottery. The season was zero and recall was zero. So that's not good. So you see, by looking at all these three metrics, we actually now know that this 99 point 99% accuracy was not great. That's because we know precision and recall was zero. So that's why it's nice to look at these metrics in tandem. And this is what I mentioned here, sometimes you want to look at all of this metric, or you want to look at the trade off between precision and recall. Because as I mentioned, you can get a high precision, but that will be lowering your recall, and this is what I'm saying here how they call them precision and get high precision, but you will be lowering recall and vice versa. And sometimes you want to find that right threshold that works for you. And it could be based on you know, on the type of business you try to operate at the cost of positive versus negative. I think that's the most important question. The cost of positive versus negative when you're working with precision and recall. So in banking, you know, if you know what type of what margin of profit you want, and how much another that's a good customer gives you versus how much you lose from a bad customer, then you know what to target, which ranges you want to target.

I see that I already moved to f1 score, because f1 score looks at both precision and recall. So it's, it's is when we want to meet somewhere in the middle, we don't want to focus on your prediction, we don't want on precision, we don't want to focus only on recall, we want to get something that gives us some word, good results when we look at both metrics. So f1 score is essentially the harmonic mean, between precision and recall, it assumes equal weights now do not that the equal weights might not be what we want, we may want to put more focus on precision, or more focus on recall. And that's why they're different versions of the F score that can do that. But also, if we look at the press at the lottery example, the F score in that example, again, is going to be zero. So this is another way, using this metric to know that this 99% accuracy was basically just because, you know, it's an extremely skewed problem is not that, you know, our model is really that good. So when to use this metric is when we care about both precision and recall, at almost equal levels. So imagine you're a police inspector, and you want to catch criminals, you want to be as her as possible that the person you can see a person who says a criminal is indeed a criminal, that's so you want high precision, but at the same time, you want to capture as many criminals as possible. Or ideally, you want to capture them all, but might not be feasible . You are okay to make some wrong predictions. But you know, you try to get as many as possible, right? So you care about recall as well. So this could be a case where you know, f1 score is what we're looking to optimize. And this is another metric that offers a way to look at this metric from multiple possible catalogs. Because when we look at precision and recall and accuracy, we'll look at one specific catalog, we normally try to find the cutoff that optimizes these metrics for us. Whereas AUC, which basically plots recall, with a sensitivity and one minus specificity, plots all of the sensitivity at one minus specificity por let's say, all possible catalysts that can come out of our model. So all the possible predictions that we've made, and it looks at this area, it looks at how well our model discriminates in general, how well a higher score is associated indeed, with our positive value, and how a lower score is indeed associated with a negative value. So how good that discrimination is, if we lose if we use all possible cut offs as our threshold point to say, this will be my threshold where I say, a prediction will be deemed as as positive or negative.


Because it does that, similar to binary cross entropy, it doesn't require a threshold. Therefore, it can make it easy to compare against different models, you know, and they usually have one means that we can perfectly discriminate. So there is a kind of out there, higher of which all values will be positive and lower, which or by this will be negative. So it offers preference discrimination, where if we get by 0.5, we're talking we had a random around the model, whether you use the model, or you were when all you do a random test is one in the same. So this is really a Rancic metric that is extremely popular. Because of what I explained it's easy to compare And thereby use of it is bounded between essentially, most commonly zero, it can take zero by use, but that means the model moves towards the negative direction. So when you are interested in their ranking, this is when you should pick this metric. So, for example, imagine you have a debt collection agency, and you know that no matter what happens, you are going to call 1000 people and try to collect money, but you receive, let's say, 3000 cases every day, so you cannot call them all.

In that case you are going to do this 1000 phone calls, let's say anyway, so, it's not really matter for probability, if you have a way to rank these 3000 cases from the more likely to pay after a phone call to the less likely to pay after a phone call. This is basically what you want, you're going to take the top 1000, you don't care about the probability or anything else, you're going to do this 1000 phone calls anyway. And, you know, if these 1000 indeed have higher terms to pay back than the remaining 2000, you already have a very successful model that gives you profit. So in that case, we're interested about the ranking, we are not as much interested about the probability, we're not interested about finding a threshold where we maximize a precision or recall, we just want to rank and take those foods that are more likely to pay back. So this is what I had. You know, obviously, this doesn't cover everything. It covers some popular metrics that I have encountered, both in modeling sciences, but mostly in industry, and try to give you some insight into where do you use one versus the other? What are some strengths? What are some pros and cons they come with? Quite often the answer might not be that one metric could be multiple, but it's also Yeah, it's very possible that one metric cannot give you everything. However, there could be a series or types of metrics that will be more suitable for your use case than others. So thank you very much for your time.