The following post is in collaboration with Hamed Namavari, Data Scientist at Unifund and Recovery Decision Science and guest blogger for Data Plus Science.
Visualizing a Confusion Matrix
by Hamed Namavari and Jeffrey Shaffer
This post is about types of analysis that aim to model a Boolean outcome using a continuous score and a cut-off point. Modeled scores can be converted to Boolean values based on a set cut-off point, i.e. above the cut-off point and below the cut-off point. For instance, where modeled probabilities as an output of a Logistic Regression, SVM, and/or Deep Learning algorithms are continuous scores.
In practice, after choosing the optimal cut-off point, the modeled Boolean outcome is typically compared to the actual Boolean outcome using a confusion matrix. For example, the actual Boolean outcome is denoted by X which has an outcome of either True or False. Similarly, the Boolean modeled outcome is denoted by X_m with the prediction of either True or False. In this example, the confusion matrix would have the following structure.
Before visualizing the matrix, we will define the components of the above matrix.
True Negatives (TN)
is the number of observations to which the model correctly assigns False values.
True Positives (TP)
is the number of observations to which the model correctly assigns True values.
False Positives (FP)
is the number of observations to which the model incorrectly assigns True values since their actual values are False. This is also known as Type I error.
False Negatives (FP)
is the number of observations to which the model incorrectly assigns False values since their actual values are True. This is also known as Type II error.
Using the components introduced above, several rates can also be defined. Some of the most important ones are explained as follows:
in simple words, accuracy rate is the ratio of the number of observations of which their values are correctly assigned by the model to the total number of observations i.e.
Although reaching high accuracy is one of the main goals in data analysis, one should not evaluate the performance of a model solely based on its accuracy rate. One of the famous examples of misleading high accuracies is in fraud detection analyses. Usually, in such analysis the modeler deals with an imbalanced dataset that is heavily populated by non-fraudulent observations, which will be False. Hence, the fraudulent observations, the Trues, are rare in the data set. For instance, a data set might only have 3% fraudulent records. Thus, an algorithm that models every record as non-fraudulent would end up with an accuracy rate of 97%, but would fail to identify any of the fraudulent records which could be a very costly practice. Based on this simple example we can see that there is a need to use other performance criteria along with the accuracy rate.
another rate that could be extracted from the confusion matrix and can be used along with the accuracy rate is sensitivity. Sensitivity is defined as:
another rate that could be calculated is specificity. Specificity is defined as:
Higher specificity and sensitivity means lower Type I and Type II errors respectively.
if incorrectly modeling True outcomes is highly costly, then precision is the performance criterion to look at when evaluating different algorithms. Precision is defined as:
For instance, in the case of identifying fraudulent records in the financial industry, if the high cost of misidentifying the fraudulent records is much higher than the low opportunity cost, then precision is the go to rate. In this example, fraudulent records are True, and non-fraudulent records are False. Higher precision is more favorable than higher accuracy given the imbalanced cost of misidentification.
So, is the confusion matrix confusing yet? Many might think so because:
1. When it comes to comparing the modeled outcomes versus the actual states, the confusion matrix is trying to compress all the information into four cells of data.
2. The four cells of data can be used to create the different rates discussed above, but each of those rates is only a scalar, and again, a compressed version of reality.
3. In some cases these rates can be very misleading; remember the 97% accuracy rate example!
This is where the power of visualization can really help. Let’s generate a sample dataset and visualize the confusion matrix. The following R code loads a .csv file from the Desktop path that contains the output of a simulation.
This graph is not very informative as it does not provide much insight into the confusion matrix. We could fix this in R, but instead, let's bring the data into Tableau to visualize. First some quick code to export the data to CSV.
After importing this into Tableau we built the visualization below. Notice that the colors in the confusion matrix align with the colors on the histogram to help visualize the records in each segment. The dark orange are the True Positives and the dark blue are the False Positives. The light orange are the False Negatives and the light blue are the True Negatives. The light and dark orange together show the shape of the Trues, for example the fraud records.
Click on the image for the interactive version on Tableau Public where you can set your own cut-off rate or download the Tableau workbork here
. You'll notice as the cut-off value decreases the False Positive rate increases and the False Negative rate decreases. We found that visualizing the confusion matrix in this manner was very helpful.
I hope you find this information helpful. If you have any questions feel free to email me at Jeff@DataPlusScience.com
Jeffrey A. Shaffer
Follow on Twitter @HighVizAbility
Connect on LinkedIn