The proliferation of AI has resulted in a demand for tools to detect and identify AI-generated content. However, it’s important that we evaluate these tools to better understand their impact on students and the ethical concerns that surround their use.
AI detectors are algorithms that attempt to “classify” or label inputs as either human or AI generated. This type of algorithm is also known as a classifier and can be evaluated by their training data and performance metrics, including accuracy, precision, and recall.
To label inputs as either human or AI, a classifier would need to be trained on both human and AI generated content. However, human-generated content is not a monolith. A reliable classifier of human generated content should be trained on a diverse sample of content that respects genre, linguistic background, education experience, etc.
Consider an AI detector for student writing. If you're trying to detect AI-generated content in a student paper, would you expect an AI detector trained only on web-published news articles to perform well? Probably not. To detect AI-generated content in a student paper, the training data should include human-generated student papers.
Generally a classifier will have a positive and negative class. For an AI detector, the positive class might be AI generated and the negative class might be human generated. When given an input, the classifier will make a either a positive or negative prediction. Knowing which label is positive and which label is negative is helpful for understanding what it means for a classifier to report a false positive. If AI generated is the positive class, then a false positive can be costly because it means the classifier has wrongly predicted human-generated content as AI.
With positive and negative classes defined, we can focus on three metrics (Accuracy, Precision, and Recall) to help evaluate the overall reliability of a classifier. For the following definitions, assume that AI is the positive class and human is the negative class.
Often, marketing surrounding AI detectors will provide the Accuracy, but not the Precision and Recall of a model. It’s important to evaluate a classifier on more than just the accuracy, because it can be misleading on its own. Consider a sample of 100 student papers, where 10 are written by a human and 90 are written by AI. If the classifier predicts that all the papers are written by AI, then it has an accuracy of 90%. While that might look like a good accuracy, the class-imbalance in the sample (10 to 90) results in an unreliable accuracy metric. Knowing the precision and recall of a model can help to identify these issues. Further, since many AI detectors use proprietary machine-learning methods and their Accuracy is often revealed in their marketing materials, there is an incentive to hide vital information that could help to evaluate the true reliability of their results.
Want to learn more about classification and machine learning? Check out Google's Machine Learning Crash Course.
If you are interested in using an AI detectors, consider broadening your evaluation to include research that evaluate their results, such as:
Again, consider the sample used to train an AI detector. If the sample is over- or under-representative of certain styles of writing, such as writing by second-language learners or writers with strong regional dialects, the detector may overly classify those writers as AI.
When uploading student work to AI detectors, you can't always be certain how their work is going to be stored or used by the detector. Be aware that uploading student assignments with personally identifiable information may be a violation of FERPA.
Consider why you're using an AI detector. Are you only checking papers that you already suspect of using AI-generated content? How do your personal and implicit biases play into decisions around the use and application of AI detectors? Rather, consider designing your assignments to be resistant to AI or incorporate explicit uses of AI to encourage responsible use.