About eighteen months ago, I started to notice machine-generated text cropping up in student work. As a composition teacher, my immediate reaction was to ban it. Text generators have little role in the composition classroom, however, composition teachers had few options for accurately identifying machine-generated text. The basic concerns were that detectors were inaccurate and could provide false positives. In other words, they might flag human writing as machine generated, especially with non-native speakers. My colleagues and I put considerable effort into redesigning courses and disincentivizing students from using AI such as ChatGPT or Bard to complete assignments. I think these changes have improved our pedagogies. Having survived a school year with AI, however, I was curious how things have changed in the world of detecting machine-generated text. As of mid-July 2024, here is what I’ve found.

Neither humans nor AI-detection systems can regularly identify machine-generated text flawlessly. However, it’s worth noting that detectors are reaching a high level of accuracy, and they are preforming better than humans. Looking at research abstracts, J. Elliott Casal and Matt Kessler found that reviewers had “an overall positive identification rate of only 38.9%” (1). Oana Ignat and colleagues found that humans could only accurately identify 71.5% of machine-generated hotel reviews (7). Their AI detector, however, was able to correctly identify roughly 81% of machine-generated hotel reviews (8). Writing in 2023, Deborah Weber-Wulff et al. found similar results when testing twelve different AI-detection programs. The highest, Turnitin and Compatio approached 80% accuracy (15). Publishing this year, Mike Perkins and colleagues found Turnitin detected 91% of machine-generated texts (103-104) while human reviewers in the study only successfully identified 54.5% (1). Custom designing an AI detector to find machine-generated app reviews, Seung-Cheol Lee et al. were able to achieve 90% accuracy with their best model (20). For longer texts, the accuracy of both human reviewers and AI detectors increases. Comparing full-length medical articles, Jae Q. J. Liu et al. found that both professors and ZeroGPT correctly identified 96% of machine-generated texts (1). (Note that GPTZero, a different AI detector, performed considerably worse.) However, the professors also misclassified 12% of human-written content as having been rephrased by AI (8).
Notably, Weber-Wulff mentions that AI detectors tend to have few false positives. In other words, if the software is unsure if a text was written by a human or a machine, it is more likely to classify it as human written (17). Turnitin, in fact, had 0 false positives (26). Perkins, too, noted that Turnitin was reluctant to label text as machine generated. While it did correctly identify 91% of papers as machine generated, it reported only 54.8% of the content in those papers as machine generated. In fact, the entire paper (100%) was machine generated (103-104). While this means a certain percentage of machine-generated writing will evade detectors, it should give professors some confidence that something flagged as machine generated is, very likely, machine generated. In another encouraging finding, Liu found that “No human-written articles were misclassified by both AI-content detectors and the professorial reviewers simultaneously” (11).
There is one caveat, however. AI detectors may flag translated or proofread text as machine generated (Weber-Wulff 26). Once machines are introduced into the composition process, they likely leave artifacts that may be noticed by AI-detectors. Strictly speaking, the AI-detectors would not be wrong. Machines were introduced into the composition process. However, most professors would find the use of machines for translation or proofreading to be acceptable.
The studies I mention to this point were attempting to consistently identify machine-generated content, but a team of researchers led by Mohammad Kutbi took a different approach. Their goal was to establish consistent, human authorship of texts by looking for a “linguistic fingerprint.” In addition to detecting the use of machine writing, this method would also detect contract plagiarism (i.e. someone hiring another person to write an essay for them). This system achieved 98% accuracy (1). While not mentioned in Kutbi’s study, other scholars have found that certain linguistic markers maintain consistency across contexts (Litvinova et al.). For these and other reasons, I believe that linguistic fingerprinting holds the most promise in detecting use of AI in the composition process.
It’s also worth mentioning that participants in Liu’s study took between four and nine minutes to make a determination about whether or not an article was written by a human (8). In this situation, AI may actually aid professors by reducing the time they need and increasing the confidence they have in determining whether or not a text was machine generated.
To briefly summarize
- Both humans and AI-detectors are prone to error
- AI detectors are generally better and in some cases significantly better than humans at identifying machine-generated text
- AI detectors are fairly conservative in their classification of text as machine generated
Considering these points, I believe that at the current time, instructors should use AI detectors as a tool to help them determine the authorship of a text. According to Liu and colleagues, Originality.ai is the best overall AI detector and ZeroGPT is the best free AI detector (10). While not as accurate as the preceding tools, Turnitin deserves mention because it did not have any false positives in multiple studies (Liu 6, Weber-Wulff 26). Of course, as with any tool, these detectors need to be used with discretion and with a consideration of the bigger context of a work. I plan to write another post considering some common flags of machine-generated text.