(Fig.

Results of a multi-class classification test (our example). In this section, two examples are introduced. Powers, Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation 2 (1) (2011) 3763. features are chosen carefully (and if they are weighted carefully in training set is retained in its entirety as a description of the object [11]H. He, E.A. Assume we have two classes, i.e., binary classification, P for positive class and N for negative class. From the figure, the values of TPA,TPB, and TBC are 80, 70, and 90, respectively, which represent the diagonal in Figure 4. The point A, in the lower left corner (0,0) represents a classifier where there is no positive classification, while all negative samples are correctly classified and hence TPR=0 and FPR=0. These input patterns are called training data which are used for training the model. In other words, instead of generating ROC points in Algorithm 1, Algorithm 2 adds areas of trapezoids5 of the ROC curve [4]. [19]S. Shaikh, Measures derived from a 2 x 2 table for an accuracy of a diagnostic test, J. Biometr. This is because (1) the recall increases by increasing the threshold value and at the end point the recall reaches to the maximum recall, (2) increasing the threshold value increases both TP and FP. As shown in the table in Figure 6, the initial step to plot the ROC curve is to sort the samples according to their scores. Powers introduced an excellent discussion of the precision, Recall, F-score, ROC, Informedness, Markedness and Correlation assessment methods with details explanations [16]. objects is called the training set because it is used by the Therefore, if the data are balanced, the precision of the end point is PP+N=12. In this section, the AUC algorithm with detailed steps is explained. From Eq. This means that more positive samples have the chance to be correctly classified; on the other hand, some negative samples are misclassified. More details about SVM can be found in [24]. Positive likelihood (LR+) measures how much the odds of the disease increases when a diagnostic test is positive, and it is calculated as in Eq. approach, which has great intuitive appeal and appears to be a Similarly, any metric can be checked to know if it is sensitive to the imbalanced data or not. 233240. Senior, Guide to biometrics, Springer Science & Business Media, 2013. (Faculty of Computer Science and Engineering, Uci repository of machine learning databases, Optimal classifier for imbalanced data using matthews correlation coefficient metric, The use of the area under the roc curve in the evaluation of machine learning algorithms, The relationship between precision-recall and roc curves, Proceedings of the 23rd International Conference on Machine Learning, Theoretical analysis of a performance measure for imbalanced data, 20th International Conference on Pattern Recognition (ICPR), A simple generalisation of the area under the roc curve for multiple class classification problems, An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics, Adjusted f-measure and kernel scaling for imbalanced data learning, Comparison of the predicted and observed secondary structure of t4 phage lysozyme, Evaluation: from precision, recall and f-measure to roc, The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets, Measures derived from a 2 x 2 table for an accuracy of a diagnostic test, Beyond accuracy, f-score and roc: a family of discriminant measures for performance evaluation, Australasian Joint Conference on Artificial Intelligence, A systematic analysis of performance measures for classification tasks, Note on the location of optimal classifiers in n-dimensional roc space, Technical Report PRG-TR-2-99, Oxford University Computing Laboratory, Oxford, England, Chaotic antlion algorithm for parameter optimization of support vector machine, Classification of toxicity effects of biotransformed hepatic drugs using whale optimized support vector machines, Receiver operating characteristic (roc) literature research, https://doi.org/10.1016/j.aci.2018.08.003, http://creativecommons.org/licences/by/4.0/legalcode, http://www.ics.uci.edu/mlearn/MLRepository.html. The training error measures how well the trained model fits the training data. relevant to the task at hand. (9) [20]. Figure 5 shows an example of the ROC curve. [25]A. Tharwat, Y.S. not used. The point D in the lower right corner (1,0) represents a classifier where all positive and negative samples are misclassified. As shown, this curve starts from the point (0,1) vertically to (0,0) and then horizontally to (1,0), where (1) the point (0,1) represents a classifier that achieves 100% FAR and 0% FRR, (2) the point (0,0) represents a classifier that obtains 0% FAR and FRR, and (3) the point (1,0) represents a classifier that indicates 0% FAR and 100% FRR. For example, with two disadvantage of neural networks is that they are notoriously slow, reported that the AGM metric is suitable with the imbalanced data [12]. On the contrary, Negative predictive value (NPV), inverse precision, or true negative accuracy (TNA) measures the proportion of negative samples that were correctly classified to the total number of negative predicted samples as indicated in Eq. classified. Hassanien, Chaotic antlion algorithm for parameter optimization of support vector machine, Appl. The FAR is also called false match rate (FMR) and it is the ratio between the number of false acceptance to the total number of imposters attempts. Both FPR and FNR are not sensitive to changes in data distributions and hence both metrics can be used with imbalanced data [9]. disadvantages of the nearest-neighbor methods. Lett. This fact is partially true because there are some metrics such as Geometric Mean (GM) and Youdens index (YI)2 use values from both columns and these metrics can be used with balanced and imbalanced data. the best features is an important part of developing a good classifier, The values of TPR and FPR of each point/threshold are calculated in Table 1. space difficult to visualize, but there are so many different used are important and useful for classification and which are object belongs that class. 2-parameter, 2-class distribution of points with parameters x,y that This is the reason why the validation phase is used for evaluating the performance of the trained model. Since the ROC curve depends mainly on changing the threshold value, comparing classifiers with different score ranges will be meaningless. (15) [3]. Not only is the resulting high-dimensional Generally, we can consider sensitivity and specificity as two kinds of accuracy, where the first for actual positive samples and the second for actual negative samples. Moreover, this paper introduces (1) the relations between different assessment methods, (2) numerical examples to show how to calculate these assessment methods, (3) the robustness of each method against imbalanced data which is one of the most important problems in real-time applications, and (4) explanations of different curves in a step-by-step approach. Finally, concluding remarks will be given in Section 8. This is the reason why the GM metric is suitable for the imbalanced data. There are many different metrics that can be calculated from the previous metrics. Bolle, J.H. If a problem has only a few (two or three) important parameters, then For example, FPA=EBA+ECA=15+0=15, and similarly FPB=EAB+ECB=15+10=25 and FPC=EAC+EBC=5+15=20. Confusion matrices of the three classes in our experiments. Green regions indicate the correctly classified regions and the red regions indicate the misclassified regions. Theoretically, scores of clients (persons known by the biometric system) should always be higher than the scores of imposters (persons who are not known by the system). especially in the training phase but also in the application phase. In the validation phase, the validation data provide an unbiased evaluation of the trained model while tuning the models hyperparameters. Any classification method uses a set of features or one can construct a number of different decision trees and let them combinations of parameters that techniques based on exhaustive searches Vol. Moreover, based on the confusion matrix, different measures are introduced with detailed explanations. In difficult to determine how the net is making its decision. are labeled as belonging to a particular class.

networks or nearest neighbor methods. Figure 6 shows an example of the ROC curve. There are several serious data is the square's diagonal. It is possible for a lower AUC classifier to outperform a higher AUC classifier in a specific region. In multi-class classification problems, Provost and Domingos calculated the total AUC of all classes by generating a ROC curve for each class and calculate the AUC value for each ROC curve [10]. This example reflects that the accuracy, precision, NPV, F-measure, and Jaccard metrics are sensitive to imbalanced data. axis-parallel trees because they correspond to partitioning the Due to the precision axis in the PR curve; hence, the PR curve is sensitive to the imbalanced data.

17 No. reported some metrics which are used in medical diagnosis [20]. There are In other words, if the similarity score exceeds a pre-defined threshold; hence, the corresponding sample is said to be matched; otherwise, the sample is not matched. Consequently, This can be interpreted as that the metrics which use values from one column cancel the changes in the class distribution. The accuracy can also be defined in terms of precision and inverse precision as follows [16]: The likelihood ratio combines both sensitivity and specificity, and it is used in diagnostic tests. Process. [8]T. Fawcett, An introduction to roc analysis, Pattern Recogn. It is used to make a balance between the benefits, i.e., true positives, and costs, i.e., false positives. According to the number of classes, there are two types of classification problems, namely, binary classification where there are only two classes, and multi-class classification where the number of classes is higher than two. The DOR metric represents the ratio between the positive likelihood ratio to the negative likelihood ratio as in Eq. In this section, an experiment was conducted to evaluate the classification performance using different assessment methods. Hence, the Geometric Mean (GM) metric aggregates both sensitivity and specificity measures according to Eq. On the other hand, in continuous output classifiers such as the Naive Bayes classifier, the output is represented by a numeric value, i.e., score, which represents the degree to which a sample belongs to a specific class. Thus, the ratio of positives and negatives defines the baseline. Lopez et al. star-galaxy classification problem on digitized photographic plates. Connell, S. Pankanti, N.K. points is high, very many steps may be required. Published by Emerald Publishing Limited. Most of our work has concentrated on oblique decision trees using the Odewahn et The classification performance is represented by scalar values as in different metrics such as accuracy, sensitivity, and specificity. Moreover, the gray shaded area is common in both classifiers, while the red shaded area represents the area where the B classifier outperforms the A classifier. In the first case which is the optimistic case, all positive samples end up at the beginning of the sequence, and this case represents the upper L segment of the rectangle in Figure 5. If the search algorithm used to construct the decision (2) [20]. All of the above methods can be modified to give a probabilistic realistic problems. It is the green curve which rises vertically from (0,0) to (0,1) and then horizontally to (1,1). This can be achieved by identifying an unknown sample by matching it with all the other known samples. Both LR+ and LR are combined into one measure which summarizes the performance of the test, this measure is called Diagnostic odds ratio (DOR).

The already know what the parameters mean? The values of FRR and FAR of each point/threshold are calculated in Table 1. Sci. The AUC score is always bounded between zero and one, and there is no realistic classifier has an AUC lower than 0.5 [4,15]. The discrete output that is generated from a classification model represents the predicted discrete class label of the unknown/test sample, while continuous output represents the estimation of the samples class membership probability. Predictive values (positive and negative) reflect the performance of the prediction. Sokolova et al. In this example, a test set consists of 20 samples from two classes; each class has ten samples, i.e., ten positive and ten negative samples. Two of the most commonly used measures in biometrics are the False acceptance rate (FAR) and False rejection/recognition rate (FRR). Adapting Classification Methods to Noise in Data. of the parameter space rapidly become computationally infeasible. However, some metrics which use values from both columns are not sensitive to the imbalanced data because the changes in the class distribution cancel each other. single parameter is compared to some constant. In that tests, not all positive results are true positives and also the same for negative results; hence, the positive and negative results change the probability/likelihood of diseases. Visit emeraldpublishing.com/platformupdate to discover the latest news and updates, Answers to the most commonly asked questions here. t3: The threshold value decreased as shown in Figure 8b) and as shown there are two positive samples are correctly classified. An illustrative example of the ROC curve. Biometrics matching is slightly different than the other classification problems and hence it is sometimes called two-instance problem. [HKS93, MKS94]. Optimization precision (OP): This metric is defined as follows: Jaccard: This metric is also called Tanimoto similarity coefficient. should not be classified as either stars or galaxies, but should be [4]A.P. If the This paper introduces a detailed overview of the classification assessment measures with the aim of providing the basics of these measures and to show how it works to serve as a comprehensive source for researchers who are interested in this field. The number of true positives of the second point is zero: In this case, since the second point is (0,0), the first point is also (0,0). It is also suitable with imbalanced data. In this experiment, we used Iris dataset which is one of the standard classification datasets and it is obtained from the University of California at Irvin (UCI) Machine Learning Repository [1]. On the contrary, more negative samples are misclassified and this increases FP and reduces TN. homogeneous regions where the objects are of the same classes. The true dividing line for the simulated

Precision and recall metrics are widely used for evaluating the classification performance. 168-192. https://doi.org/10.1016/j.aci.2018.08.003, Published in Applied Computing and Informatics. Tharwat, A.

In Section 7, results in terms of different assessment methods of a simple experiment are presented. Hence, this curve shows the relation between FAR and FRR. The values of different classification metrics are as follows, Acc=70+8070+80+20+30=0.75,TPR=7070+30=0.7,TNR=8080+20=0.8,PPV=7070+200.78,NPV=8080+300.73,Err=1Acc=0.25,BCR=12(0.7+0.8)=0.75,FPR=10.8=0.2,FNR=10.7=0.3,Fmeasure=270(270+20+30)=0.74,OP=Acc|TPRTNR|TPR+TNR=0.75|0.70.8|0.7+0.80.683,LR+=0.710.8=3.5,LR=10.70.8=0.375,DOR=3.50.3759.33,YI=0.7+0.81=0.5, and Jaccard=7070+20+300.583. Hence, changing the ratio between the positive and negative classes changes that line and hence changes the classification performance. From this figure, the following remarks can be drawn. In that paper, only eight measures were introduced. For example, if FAR=10% this means that for one hundred attempts to access the system by imposters, only ten will be succeeded and hence increasing FAR decreases the accuracy of the model. [3]S. Boughorbel, F. Jarray, M. El-Anbari, Optimal classifier for imbalanced data using matthews correlation coefficient metric, PLoS One 12 (6) (2017) e0177678. This variant represents the weighted harmonic mean between precision and recall as in Eq. are able to classify objects well even when the distribution of objects (a) ROC curve, (b) Precision-Recall curve. It is The paper aimed to give a detailed overview of the classification assessment measures. Classification techniques have been applied to many applications in various fields of sciences. As shown in Algorithm 2 the AUC score can be calculated by adding the areas of trapezoids of the AUC measure. that object. Mathematically, this means that at each node a In other words, the PR curves and their AUC values are different between balanced and imbalanced data. Consequently, it is hard to determine which of the image features being that has a random value for all objects (so that it does not separate An illustrative example is introduced to show (1) how to calculate these measures in binary and multi-class classification problems, and (2) the robustness of some measures against balanced and imbalanced data. Similarly, the results of the other two classes can be calculated. Likelihood ratio measures the influence of a result on the probability. parameters to characterize each object, where these features should be This means that the given learning model identifies positive samples better than negative samples. Figure: Sample classification problem with only two features. The value of F-measure is ranged from zero to one, and high values of F-measure indicate high classification performance. The samples are classified into the positive class if their scores are higher than or equal the threshold; otherwise, it is estimated as negative [8]. Section 3 introduces the basics of the ROC curve, which are required for understanding how to plot and interpret it. However, the aims of sensitivity and specificity are often conflicting, which may not work well, especially when the dataset is imbalanced. Hence, if the ratio of positive to negative samples changes in a test set, the ROC curve will not change. [24]A. Tharwat, A.E. Knowledge Data Eng. Figures 7 and 8 shows how changing the threshold value changes the TPR and FPR. The same approach can be applied to They obtained good results for objects in a limited brightness range. t20: As shown in Figure 8(f), decreasing the threshold value hides the FN area. provided a set of sample objects with known classes. (4) [21]. After a series of These are called Further, in a step-by-step approach, different numerical examples are demonstrated to explain the preprocessing steps of plotting ROC and PR curves in Sections 3 and 5. 30 (7) (1997) 11451159. Hence, the same point (0.2, 0.5) means that the classifier obtained 50% sensitivity (500 positive samples are correctly classified from 1000 positive samples) and 80% specificity (8000 negative samples are correctly classified from 1000 negative samples). Sensitivity depends on TP and FN which are in the same column of the confusion matrix, and similarly, the specificity metric depends on TN and FP which are in the same column; hence, both sensitivity and specificity can be used for evaluating the classification performance with imbalanced data [9]. Additionally, from the figure, it is clear that many assessment methods depend on the TPR and TNR metrics, and all assessment methods can be estimated from the confusion matrix. 2 (2011) 14. Classification techniques have been applied to many applications in various fields of sciences. information, though -- this approach asks the classifier to discover For example, consider a simple The horizontal line which passes through PP+N represents a classifier with the random performance level. The analysis of such metrics and its significance must be interpreted correctly for evaluating different learning algorithms. rather slow if the training set has many examples. From Figure 7 it is clear that the ROC curve is a step function. order to determine what their classes are likely to be. Values of TP,FN,TN,FP,TPR,FPR,FNR, precision (PPV), and accuracy (Acc in %) of our ROC example when changes the threshold value. A trapezoid is a 4-sided shape with two parallel sides. As the threshold is further reduced to be 0.8, the TPR is increased to 0.2 and the FPR remains zero. Another problem with the accuracy is that two classifiers can yield the same accuracy but perform differently with respect to the types of correct and incorrect decisions they provide [9]. Biophys. This metric is defined as follows: Markedness (MK): this is defined based on PPV and NPV metrics as follows, MK=PPV+NPV1 [16]. These In this example, assume we have two classes (A and B), i.e., binary classification, and each class has 100 samples. This method of calculating the AUC score is simple and fast but it is sensitive to class distributions and error costs. In the training phase, the the value is smaller, the left branch is followed. There are two cases for estimating the first point depending on the value of TP of the second point. On the contrary, some clients are falsely rejected (see Figure 11 (top panel)). As a consequence, the values of TP,TN,FP, and FN are 70, 800, 200, and 30, respectively. Jaccard metric explicitly ignores the correct classification of negative samples as follows, Jaccard=TPTP+FP+FN. In this curve, as in the ROC and PR curves, the threshold value is changed and the values of FAR and FRR are calculated at each threshold. Moreover, a good investigation of some measures and the robustness of these measures against different changes in the confusion matrix are introduced in [21]. The point (0.2,0.5) on the ROC curve means that the classifier obtained 50% sensitivity (500 positive samples are correctly classified from 1000 positive samples) and 80% specificity (800 negative samples are correctly classified from 1000 negative samples). An unknown sample is classified to P or N. The classification model that was trained in the training phase is used to predict the true classes of unknown samples. The value of true negative for the class A (TNA) can be calculated by adding all columns and rows excluding the row and column of class A; this is similar to the TN in the 22 confusion matrix. 329 (7458) (2004) 168169. 3 (3) (2016) 197240. Section 5 presents the basics of the Precision-Recall curve and how to interpret it. [17]T. Saito, M. Rehmsmeier, The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets, PLoS One 10 (3) (2015) e0118432. 45 (2) (2001) 171186.

[1]C. Blake, Uci repository of machine learning databases, 1998. http://www.ics.uci.edu/mlearn/MLRepository.html. This article is published under the Creative Commons Attribution (CC BY 4.0) license. probabilities, multiplied down to the leaf nodes and summed over all The ROC curve is generated by changing the threshold on the confidence score; hence, each threshold generates only one point in the ROC curve [8]. The A class represents the positive class while the B class represents the negative class. In the PR curve, a curve above the other has a better classification performance. construction (training) phase than neural network methods, and they In opposition to this, lowering the threshold value accepts all clients and also some imposters are falsely accepted. a set of feature weights specific to that node) and the sum is compared Adding a single parameter Additionally, in a step-by-step approach, different numerical examples are demonstrated to explain the preprocessing steps of plotting ROC, PR, and DET curves. As shown, the AUC of B classifier is greater than A; hence, it achieves better performance. As shown in Figure 10, the PR curve is often zigzag curve; hence, PR curves tend to cross each other much more frequently than ROC curves. In this paper, the definition, mathematics, and visualizations of the most well-known classification assessment methods were presented and explained.

If the feature value is Thus, in the PR curve, the x-axis is the recall and the y-axis is the precision, i.e., the x-axis of ROC curve is the y-axis of PR curve [8].

The model is trained using input patterns and this phase is called the training phase. Also, from the figure, it is clear that the FP area is much larger than the area of TN. Stork, et al., Pattern Classification, vol. This metric represents the number of misclassified samples from both positive and negative classes, and it is calculated as follows, EER=1Acc=(FP+FN)/(TP+TN+FP+FN) [4]. Pattern Recogn. Acta 405 (2) (1975) 442451. methods: they often produce very simple structures that use only a few Figure 13 shows the ROC and Precision-Recall curves. The TPR increased to 0.1, while the FPR remains zero. Hart, D.G. The size of this rectangle is pnPN, and the number of errors in both optimistic and pessimistic cases can be calculated as follows, pn2PN. Visualization of different metrics and the relations between these metrics. neural networks) do not simplify the distribution of objects in The outputs of classification models can be discrete as in the decision tree classifier or continuous as the Naive Bayes classifier [7]. However, the outputs of learning algorithms need to be assessed and analyzed carefully and this analysis must be interpreted correctly, so as to evaluate different learning algorithms. Using TP,TN,FP, and FN we can calculate all classification measures.

ROC curves are robust against any changes to class distributions. [18]A. Shaffi, Measures derived from a 2 x 2 table for an accuracy of a diagnostic test, J. Biometr. Next, the values of TPR and FPR are calculated and pushed into the ROC stack (see step 6). [13]A. Maratea, A. Petrosino, M. Manzo, Adjusted f-measure and kernel scaling for imbalanced data learning, Inf. For example, the false positive in class A (FPA) is calculated as follows, FPA=EBA+ECA. al. Based on the data that can be extracted from the confusion matrix, many classification metrics can be calculated. Oblique decision trees attempt to overcome the disadvantage of It evaluates the discriminative power of the test. In other words, it is the proportion of the negative samples that were incorrectly classified. Eq. 2).

We used (1) the Principal component analysis (PCA) [23] for reducing the features to two features and (2) Support vector machine (SVM)6 for classification.

An illustrative example of the 22 confusion matrix. The accuracy can also be defined in terms of sensitivity and specificity as follows [20]: False positive rate (FPR) is also called false alarm rate (FAR), or Fallout, and it represents the ratio between the incorrectly classified negative samples to the total number of negative samples [16]. As shown in Figure 7, increasing the TPR moves the ROC curve up while increasing the FPR moves the ROC curve to the right as in t4. [7]R.O. Since the PR curve depends only on the precision and recall measures, it ignores the performance of correctly handling negative examples (TN) [16].

68 (2017) 132149. There are several ways of evaluating classification algorithms.