Many people have asked us how we assess WriteLab's accuracy. I'd like to explain the different measures we use to test our algorithms.
WriteLab identifies Writing Features that correspond to the Comments it generates. When we evaluate the effectiveness of our algorithms, we focus on two measurements: precision and recall.
Precision measures how often WriteLab is correct when it identifies a Writing Feature. An algorithm with high precision will have few false positives (instances when it identifies a Feature that is not actually present).
Recall measures the portion of the Features present in a text that WriteLab identifies. An algorithm with high recall would have few false negatives (instances when the Feature is present but not identified).
We will evaluate the effectiveness of ineffective_nominalization, an algorithm which detects a nominalization (the noun form of a verb, adjective, or adverb) that we consider ineffective. We will use simple rounded numbers in this example to illustrate our concepts.
I. The "Gold Standard"
We need some standard against which to measure our algorithm. We can either write a test text from scratch or annotate an existing text to establish our gold standard. In this test text, we have identified all the instances of the Writing Feature we are looking for, and all the instances that are not the Feature. We call these cases condition positive and condition negative.
We write a test of 100 sentences. 10 of those sentences contain ineffective nominalizations. 90 of them do not, having either what we consider effective nominalizations or no nominalizations at all. Condition positive is 10, and condition negative is 90. (Actual tests vary in length and are often several thousand sentences.)
II. Test Outcomes
After running our test set through the algorithm, we receive our results, our test outcome. WriteLab identified ineffective nominalizations in 40 of the sentences and no ineffective nominalizations in 60 of the sentences. This means that test outcome positive is 40, and test outcome negative is 60. We also need to know how these results matched the condition, or the actual occurrence of ineffective nominalizations in the test set.
Of the 40 cases where our algorithm identified ineffective nominalizations, only 10 were actually ineffective nominalizations. These are our true positives, or cases where WriteLab correctly identified the Writing Feature. The other 30 outcomes were not ineffective nominalizations, and are called false positives. WriteLab identified a Writing Feature, but it was not actually present.
All 60 of our negative test outcomes (cases where WriteLab did not detect a Feature) were condition negative (actual negative cases), so we have no false negatives and 60 true negatives.
The summary of our test outcomes are below:
III. Measuring Effectiveness
With these values we can now calculate precision, recall, and other measures.
Precision is true positives / test outcome positives. ( 10 / 40 = 25% )
Our precision is 25%: only 25% of the ineffective nominalizations we identified were true ineffective nominalizations (according to our test set). Algorithms with high precision are likely to be correct when they detect a Writing Feature. At only 25% precision, our example algorithm is not very precise, and quite likely to give us an erroneous response (false positive). We only deploy algorithms with at least 90% precision, so we would not release this example algorithm until we improved it to meet our standards.
Recall is true positives / condition positives. ( 10 / 10 = 100% )
Our recall is 100%: we have correctly identified 100% of the ineffective nominalizations in the test text. With 100% recall, our example algorithm did not miss a single ineffective nominalization in our test set (no false negatives). High recall algorithms like our example are good at identifying true examples of Writing Features.
Accuracy is (true positives + true negatives) / total population. ( (10 + 60) / 100 = 70% )
Our accuracy is 70%: we gave the correct response (positive or negative) for 70% of the test set. Our algorithm identified Writing Features when they were present and didn't identify Features when they were not present most of the time.
It may be convenient to use a single measure like accuracy, but doing so can distort how we evaluate our algorithms. Accuracy doesn't capture the details of our example function in the same way precision and recall do.
Being correct 70% of the time suggests that our example algorithm is reasonably accurate. The actual situation is more nuanced. With 100% recall, our algorithm always finds all the ineffective nominalizations in a text. It performs perfectly in this regard. However, with only 25% precision, the majority (three out of four) of the Comments that a user would receive would actually be useless false positives. This is why we need a more nuanced measure than accuracy to evaluate our algorithms.
Calculating a combined measure for the algorithm: F-score
To establish a more useful combined measure of recall and precision, we can use the harmonic mean of the two, called the Fβ score.
Fβ = (1 + β2) * (precision * recall) / ( (β2 * precision) + recall) )
A higher β value emphasizes recall and a lower β emphasizes precision. Because we privilege precision over recall, we can set β = 0.5. Our F0.5 score would be 35%. Since we can adjust the β value, F-scores are flexible measurements that we can modify for different scenarios.
IV. What it means
Even if an algorithm has 100% recall and detects every instance of the feature we are looking for, if its precision is low (as in our example), the algorithm will be of limited use because it spits out so many false positives. Alternatively, if we had an algorithm with high precision and low recall, we could be certain it wasn't generating erroneous Comments, but it would not detect many instances of the Writing Feature.
Precision often comes at the expense of recall, and vice versa. To increase recall, we cast a wider net to detect more cases. To increase precision, we refine our strategy to prevent false positives, often detecting fewer cases in total. We've privileged precision so far, with the thresholds currently set at 90% for precision and 25% for recall. These values might vary for different Writing Features. For example, a user might more readily forgive a false positive suggestion about Logic than a false positive suggestion about Concision. Our high threshold for precision (90%) and lower threshold for recall (25%) mean that we would rather miss a few true instances of a Writing Feature than erroneously identify false instances.
We continuously refine our algorithms to improve their performance and add to our test sets to make them more comprehensive.