How to build automated essay scoring engines, and why they don’t work

How to build automated essay scoring engines, and why they don’t work

In this post, you will find everything you need to build an automated essay scoring (AES) engine. You can do this at home. I’m writing this so you can understand the assumptions these systems make and how those assumptions hurt students. This way the next time a rep tries to sell you on the magic of machine scoring, you will know exactly how that ‘magic’ works.

All AES systems use some form of Natural Language Processing (NLP, a fancy term for computer operations on a piece of text), but more recent ones rely heavily on machine learning to predict what score an essay should receive. The assumption here is that some essays are better than others and we can condense that betterness into a single number and predict that number for new essays. Machine learning engineers call these predictions ‘classifications,’ and to perform them, they build classifiers. You’ll need three things to build your own classifiers: tagged data (student essays with scores), features (correlating variables that the algorithm considers when looking at a text), and machine learning algorithms.

No matter the classification task, you’ll need to think hard about how to get the most out of these three factors. Unless you’re a researcher in machine learning, you’re most likely going to use algorithms out of the box, like the ones available here: http://scikit-learn.org/stable/. Tagged data is the limiting factor: for essay scoring, you’ll need a lot of essays that are reliably scored. When Google trains its text and image classifiers, its algorithms are looking at tens to hundreds of millions of data samples in order to obtain accurate results. AES systems are trained off of several hundred to a few thousand samples. This leads to overfitting

Overfitting occurs when you have too little training data to make an accurate prediction about data you haven’t seen. Consider the following example. You see two men wearing gray suits and one man wearing a white suit (data). You are asked to keep in mind the color of their suits (feature). Then you are told that the two men in gray suits are convicted felons, and the man in white is an anointed saint (tags for data). You are then shown a man that is wearing a gray suit and asked to determine whether he is a convicted felon or an anointed saint. Given the training data and the feature you are asked to consider, you’d probably say ‘convicted felon.’ This is the way machine learning algorithms work, and if the data is not statistically significant, we will likely overfit to our training set and make bad predictions about men in suits. With AES and high-stakes writing assessment, this is catastrophic.

But suppose we really have to build this system, and we don’t have much data. What can we do? We can improve the features. Building the features is called feature engineering, and it’s something that most people run a quick hand over. Most people working on these systems lack the domain expertise specific to the task, which in this case is teaching writing. In fact, most of these engineers throw the same kinds of features at all NLP tasks. Some of these features are: sentence length, word frequency, ngram frequency, part of speech, and lots of other syntactic and lexical elements that you can pull directly out of the Stanford parser. But good feature engineering requires domain expertise, which few in AES have. Instead, those in AES do the following: 1) throw every feature they can find (number of nouns, number of verbs, ad infinitum) into a model; 2) use an algorithm to find the most important concepts among the features by looking at the data; and 3) use those concepts to make decisions about unseen data. You may have noticed something here. The algorithm is doing the work that a domain expert would normally do, but it needs to look at the data to do so. What this workflow entails is that any improvement (aside from simple addition) in the features requires that the data be reliable. This kind of workflow multiplies the margins of error and reinforces the illusion that overfitting produces.

So now, fully aware of the flaws inherent to such systems, you can go home and brew your own AES system. Remember to follow these steps:

     (1) Get a parser:
           a.  http://spacy.io/
           b.  http://nlp.stanford.edu/software/lex-parser.shtml
     (2) Get a bunch of essays and grade them.
     (3) Write a script to import the parser and the essays.
     (4) Use the same script to parse the text, and then convert the parsed text into numbers                      representing language features (number of nouns in sentence, etc.).
     (5) Add these numbers to an array.
     (6) Import machine learning algorithms from here:  http://scikit-learn.org/stable/
     (7) Feed arrays into machine learning algorithms to train them.
     (8) You’re done; now you can feed unseen data into your model and start scoring!

If you have questions, or would like more details to fuel your home brew AES system, email me at matthew@writelab.com.

How To Write

How To Write

Writing At WriteLab, Episode 1

Writing At WriteLab, Episode 1