Academic Integrity
Why AI Plagiarism Detection Is Failing Higher Education And What to Do Instead
AI detection tools have a 15–30% false positive rate. Pen-and-paper retreats are narrowing what universities actually teach. Eduface's new whitepaper argues there is a better framework one built on genuine intellectual ownership rather than surface detection.
Eduface · 11 min read · Learning Technologists, DVCs & Policy Leads
This article summarises Eduface's 2025 whitepaper: Beyond the Plagiarism Checker: A New Framework for Academic Integrity in the Age of AI.
Download the whitepaper →
In the autumn of 2022, a Princeton student built an AI detection tool during his winter
break. Within weeks, GPTZero had been used millions of times. That adoption speed
tells you everything about the anxiety gripping higher education at that moment:
ChatGPT had arrived, institutions had no clear answer, and the first response was
detection. For most of the sector, it still is. That response is not working.
Is AI plagiarism detection reliable enough to use in higher education?
No. Commercial AI detection tools report false positive rates of 15–30%, with
particular bias against multilingual and non-native English-speaking students.
Turnitin's own guidance states that its AI writing detection should not be used as
the sole basis for adverse actions against a student. A detection system unreliable
enough that its creator warns against acting on it is not a quality assurance
mechanism. Eduface's 2025 whitepaper argues that the sector needs a different
framework entirely — one built on verifiable intellectual ownership rather than
surface-level text analysis.
No. Commercial AI detection tools report false positive rates of 15–30%, with
particular bias against multilingual and non-native English-speaking students.
Turnitin's own guidance states that its AI writing detection should not be used as
the sole basis for adverse actions against a student. A detection system unreliable
enough that its creator warns against acting on it is not a quality assurance
mechanism. Eduface's 2025 whitepaper argues that the sector needs a different
framework entirely, one built on verifiable intellectual ownership rather than
surface-level text analysis.
Why is AI plagiarism detection failing in higher education?
The detection approach rests on a premise that has not aged well: that AI-generated
text is reliably distinguishable from human-written text. A 2025 evidence synthesis in
MDPI's Information journal, examining peer-reviewed literature from 2021 to 2024,
found that commercial AI detectors frequently produce false positives and lack
transparency — particularly for multilingual and non-native English speakers, whose
more formulaic writing is disproportionately flagged.
The detection approach rests on a premise that has not aged well: that AI-generated
text is reliably distinguishable from human-written text. A 2025 evidence synthesis in
MDPI's Information journal, examining peer-reviewed literature from 2021 to 2024,
found that commercial AI detectors frequently produce false positives and lack
transparency, particularly for multilingual and non-native English speakers, whose
more formulaic writing is disproportionately flagged.
1
The RAID Benchmark (ACL 2024) found that detector performance shifts substantially
across AI models, domains, and adversarial edits including edits that resemble
normal student revision.
The RAID Benchmark (ACL 2024) found that detector performance shifts substantially
across AI models, domains, and adversarial edits, including edits that resemble
normal student revision.
2
A Brock University study found that human participants could
identify AI-generated text at a true positive rate of just over 24%: barely above
chance.
3
Researchers at the University of Maryland have concluded that, as the
statistical distance between AI-generated and human written text continues to narrow,
even optimally calibrated detectors will approach random guessing. This is a structural
limit, not a solvable technical problem.
15–30%
False positive rate in commercial AI detection
tools — meaning a student who wrote their
own work can be flagged for misconduct they
did not commit. Multilingual students are
disproportionately affected.
False positive rate in commercial AI detection
tools, meaning a student who wrote their
own work can be flagged for misconduct they
did not commit. Multilingual students are
disproportionately affected.
MDPI Information, Vol. 16 (2025); Brandeis University AI
Literacy guidance.
The equity dimension is significant. Detection tools calibrated on majority-language
writing patterns systematically disadvantage the students least likely to have any
recourse against a misconduct accusation. A regime with a 15–30% false positive rate
does not enforce fairness. It undermines it.
The detection cycle: a loop with no exit
Essay
Assignment
Student
Engages AI
Detection
Deployed
Detector
Fails
Return to
Paper?
Cycle repeats. No learning outcome improved.
Each institutional response generates the next problem. The arms race has produced no clear winner and
considerable collateral damage to innocent students.
Is going back to pen-and-paper exams the answer to AI cheating?
When detection failed to reassure, many institutions reached for a more drastic
solution: remove the digital environment entirely. Blue book sales at UC Berkeley rose
80% in the 2024–25 academic year. At the University of Florida, the figure was 50%.
Several Russell Group universities reintroduced in-person written exams for courses
that had migrated online during the pandemic, citing AI concerns explicitly.
4
The logic is understandable. A student writing by hand in a supervised room cannot
paste from a language model. But this response treats a symptom rather than a cause,
and it comes with real educational costs. Timed handwritten examinations under
pressure assess a narrow band of competencies: recall, speed, handwriting legibility.
They are a poor proxy for the analytical writing, structured argumentation, and research
synthesis that most degree programmes claim to develop.
"In retreating to supervised handwriting, universities are not
upholding educational standards. They are quietly abandoning some of the most important ones."
There is also a workforce readiness problem. AI use by law firm professionals
increased 315% between 2023 and 2024, with 79% of law firm professionals now
incorporating AI tools into their daily work.
There is also a workforce readiness problem. AI use by law firm professionals
increased 315% between 2023 and 2024, with 79% of law firm professionals now
incorporating AI tools into their daily work.
5
In medicine, engineering, finance, and
education the trajectory is comparable. A graduate who has never worked with AI tools
because their university banned them is not better prepared. They are three years
behind before they have started.
What does academic integrity actually mean in the age of AI?
The word integrity comes from the Latin integritas, wholeness, the state of being
undivided. Applied to academic work, it means that a piece of writing is a genuine
expression of the student who produced it: that the ideas, reasoning, and conclusions
belong to the person who signed their name.
Plagiarism detection software, most notably Turnitin after its commercial launch in
1998, attempted to close that gap technologically, measuring surface similarity as a
proxy for intellectual authenticity. What happened next is a familiar institutional story:
the proxy gradually became the thing being assessed. Integrity came to mean "not
flagged."
Compliance definition (what we measure)
Integrity definition (what we should
measure)
Not flagged by detection software
Understands and owns the work
Low similarity score
Can explain and defend arguments
Submitted on time
Engaged genuinely with sources
No institutional record
Demonstrates actual learning
Generative AI exposed the absurdity of the compliance definition. If a student submits
an essay written entirely by a language model with no similarity to any indexed source,
Turnitin returns a low similarity score. Under the proxy definition, the work is "not
plagiarised." Under any meaningful definition of integrity, does this student
understand what they submitted? Can they explain the argument? It is a complete
failure. The proxy and the underlying value had quietly parted ways.
What is the evidence-based alternative to AI detection for academic
integrity?
The answer is not to abandon written assignments. The written argument is one of the
most cognitively demanding things a student can do, and research confirms that
argumentative writing assignments are particularly advantageous for developing critical
thinking, evidence evaluation, and the construction of defensible positions.
6
The essay
is not the problem. The problem is that we have been treating submission as proof of
understanding, when it has only ever been a proxy for it.
The solution, as Eduface's whitepaper argues, is to enrich the written assignment with
an oral layer and to do so at scale. This is not a new idea. In the sixteenth century, all
examinations at Oxford and Cambridge were oral examinations. The doctoral viva still
functions on exactly this principle. At doctoral level, no university considers submission
alone to be sufficient evidence of intellectual ownership. The conversation is the
verification. The question worth asking is why this principle stops at doctoral level.
The oral enrichment process: integrity by design
01
Student
submits paper
02
AI reads
the paper
03
Questions
generated
04
Student
responds
05
Report to
lecturer
A student who did not write their paper cannot answer questions designed from within it.
Eduface's oral examination tool reads the submitted paper and generates deep, paper-specific questions. Not
generic comprehension tests — questions that emerge from that student's own argument, evidence, and
structure.
How does AI oral assessment work at scale in higher education?
The traditional oral examination is time-intensive and difficult to standardise at scale which is precisely why it has been confined to doctoral defences. A viva requires a
trained examiner, a scheduled session, and significant time per student. In a cohort of
three hundred, that is prohibitive.
Eduface's oral examination tool removes that constraint. The tool reads through the
student's submitted paper and generates deep, contextually specific questions about
it: not generic comprehension questions, but questions derived from the content,
structure, and argumentation of that particular paper. If a student's law essay argues
that promissory estoppel was widened excessively in a particular case, the tool might
ask: why do you consider that widening excessive rather than necessary? What would
a counter-argument look like? How does your position relate to the subsequent case
law you cited in paragraph four? A student who wrote the paper can answer these
questions. A student who did not is immediately exposed.
In an internal evaluation conducted with 200 students across pilot institutions, 89%
of participants rated Eduface's AI-generated oral assessment as preferable to or
equal to prior human-only feedback. The tool integrates with Blackboard,
Brightspace, Moodle, and Canvas through standard LTI protocols. A cohort of 800 is
handled identically to a cohort of 80. Eduface holds approved supplier status on the
Jisc/CHEST framework in the UK and the HEAnet framework in Ireland.
How effective is oral enrichment compared with detection-only
approaches?
The whitepaper presents a direct comparison of three assessment design approaches
on two dimensions: learning value and integrity assurance. The results are
unambiguous.
Detection only
Written assignment
alone
Written + oral
layer
Combined effectiveness
25%
65%
95%
Learning incentive
None
Strong
Strongest
Integrity assurance
Unreliable
Limited
Structural
Equity risk
High (ESL bias)
Low
Low
NSS feedback impact
None
Moderate
Significant
The learning value dimension deserves emphasis. Research by Roediger and Karpicke
at Washington University and Purdue University demonstrated that the act of retrieving
information being tested on it is itself a more powerful learning mechanism than
restudying the same material.
7
When students know they will be asked to explain and
defend their work in a follow up oral component, they approach the writing differently.
They cannot afford to submit something they do not understand. This is integrity by
design: the oral layer creates an incentive for genuine engagement from the moment
the assignment is set, not a sanction after the fact.
Frequently asked questions
Q
Does oral assessment work for large cohorts in higher education?
Yes, this is the central design principle. The Eduface oral examination tool integrates
with existing LMS platforms and runs asynchronously within a student-chosen
window. A module of 800 is handled identically to a module of 30. There is no
additional staffing requirement; the resource constraint that made traditional viva
examination unscalable is removed entirely.
Q
Is AI oral assessment fair and legally defensible as an integrity mechanism?
The oral layer generates paper-specific questions from the student's own submitted
text. Because questions target that student's specific arguments, sources, and
structure, they are resistant to preparation in advance. A 2024 systematic review in
IIER found that oral assessments show strong validity and reliability as integrity
mechanisms. Unlike AI detection, the outcome is a positive demonstration of
understanding rather than a probabilistic inference about text origin.
Q
Does this approach comply with the EU AI Act?
Eduface processes all student data on proprietary GPU infrastructure in the
Netherlands and does not pass submissions to external AI providers. The oral
assessment tool operates as a support tool for lecturer decision-making; the lecturer
retains full control over grading. Both the AI assessment and oral tools satisfy the
human oversight requirements of Article 14 of the EU AI Act (Regulation 2024/1689).
Q
Will students resist an oral component on top of a written assignment?
Experience from pilot institutions suggests the opposite. When students know an oral
component follows, engagement with the written work improves. Students write more
carefully, engage more honestly with sources, and arrive at the oral component better
prepared to discuss their own ideas. The anticipation effect, knowing you will need
to account for what you wrote, produces more genuine intellectual engagement than
any detection tool has ever achieved.
Q
How does this affect NSS scores for assessment and feedback?
Assessment and feedback has been the weakest-performing theme in the NSS since
the survey's introduction, with particular concerns about whether feedback is specific
and acted upon. The oral layer provides every student with individualised, substantive
engagement with their specific work, the kind of feedback that research
consistently identifies as most useful for learning. NSS data suggests this addresses
the area where student dissatisfaction is most acute and most persistent.
The integrity crisis in higher education is real, but it has been misdiagnosed. The
problem is not that students use AI. It is that institutions have built assessment systems
that treat submission as proof of learning, and have responded to the challenge of
generative AI by trying and failing to police the gap between what students submit
and what they understand. That gap has always existed. AI made it visible. The answer
is assessment design that makes genuine intellectual ownership a structural
requirement, not an assumption. The oral enrichment of written assignments achieves
this, at scale, and with a learning benefit that no detection tool can provide.
Free whitepaper
Beyond the Plagiarism Checker
A New Framework for Academic Integrity in the Age of AI
34 pages. 30 academic references. A practical implementation
pathway for institutions ready to move beyond detection.
Download the whitepaper
Request a pilot
References
MDPI Information, Vol. 16 (2025). Evaluating AI detection tools in higher education: qualitative evidence
synthesis of peer-reviewed literature, 2021–2024.
RAID Benchmark. (2024). ACL 2024. Detector performance variability across AI models, domains, and
adversarial edits.
Kumar, S., & Mindzak, M. (2024). Brock University study on human identification of AI-generated text;
Brandeis University AI Literacy guidance on detection tool limitations.
Wall Street Journal / The Daily Cardinal (2025). Blue book sales data, University of Florida and UC
Berkeley. PaperSurvey.io — Universities Returning to Paper Exams (2024–25).
NetDocuments / AttorneyJournals (2025). AI-Driven Legal Tech Trends. AI use in law firms up 315%,
2023–2024; 79% law firm professionals using AI daily. JDJournal (Sept 2025) — U.S. Law Schools
Make AI Training Mandatory.
Frontiers in Education (2022). The Challenge of Position-Taking in Argumentative Writing.
Argumentation valuable for critical thinking and evidence evaluation.
Roediger, H. L., & Karpicke, J. D. (2006). Test-enhanced learning: Taking memory tests improves long-
term retention. Psychological Science, 17(3), 249–255.
Fenton, A. (2025). Reconsidering the use of oral exams and assessments. Educational Researcher
(SAGE). Nallaya et al. (2024). Validity, reliability and academic integrity of oral assessments: a
systematic review. IIER.
Office for Students NSS 2024 results. Assessment & Feedback consistently lowest-scoring theme
across UK sector. Advance HE (2024) NSS 2024 analysis.
Academic Integrity
Why AI Plagiarism Detection Is
Failing Higher Education — And
What to Do Instead
AI detection tools have a 15–30% false positive
rate. Pen-and-paper retreats are narrowing what
universities actually teach. Eduface's new
whitepaper argues there is a better framework
one built on genuine intellectual ownership rather than surface detection.
Eduface
·
11 min read
DVCs, Policy Leads & Learning Technologists
This article summarises Eduface's 2025 whitepaper:
Beyond the Plagiarism Checker: A New Framework for
Academic Integrity in the Age of AI.
Download the whitepaper →
In the autumn of 2022, a Princeton student built
an AI detection tool during his winter break.
Within weeks, GPTZero had been used millions
of times. That adoption speed tells you
everything about the anxiety gripping higher
education at that moment: ChatGPT had arrived,
institutions had no clear answer, and the first
response was detection. For most of the sector, it
still is. That response is not working.
Is AI plagiarism detection reliable enough to
use in higher education?
No. Commercial AI detection tools report false
positive rates of 15–30%, with particular bias
against multilingual and non-native English-
speaking students. Turnitin's own guidance
states that its AI writing detection should not
be used as the sole basis for adverse actions
against a student. A detection system
unreliable enough that its creator warns
against acting on it is not a quality assurance
mechanism.
Why is AI plagiarism detection failing in
higher education?
The detection approach rests on a premise that
has not aged well: that AI-generated text is
reliably distinguishable from human-written text.
A 2025 evidence synthesis in MDPI's Information journal, examining peer-reviewed literature from 2021 to 2024, found that commercial AI detectors frequently produce false positives and lack transparency, particularly for multilingual and non-native English speakers, whose more formulaic writing is disproportionately flagged.
The RAID Benchmark (ACL 2024) found that
detector performance shifts substantially across
AI models, domains, and adversarial edits, including edits that resemble normal student
revision.
2
A Brock University study found that
human participants could identify AI-generated
text at a true positive rate of just over 24%:
barely above chance.
3
15–30%
False positive rate in commercial AI detection
tools meaning a student who wrote their
own work can be flagged for misconduct they did not commit. Multilingual students are
disproportionately affected.
MDPI Information, Vol. 16 (2025); Brandeis University AI Literacy
guidance.
The equity dimension is significant. Detection
tools calibrated on majority-language writing
patterns systematically disadvantage the
students least likely to have any recourse against
a misconduct accusation. A regime with a 15
30% false positive rate does not enforce fairness
it undermines it.
The detection cycle: a loop with no exit
Essay
Assignment
Student
Engages AI
Detection
Deployed
Detector
Fails
Return to
Paper?
Cycle repeats. No learning outcome improved.
The arms race has produced no clear winner and considerable
collateral damage to innocent students.
Is going back to pen-and-paper exams
the answer?
When detection failed to reassure, many
institutions reached for a more drastic solution:
remove the digital environment entirely. Blue
book sales at UC Berkeley rose 80% in the
2024–25 academic year. At the University of
Florida, the figure was 50%. Several Russell
Group universities reintroduced in-person written exams.
4
The logic is understandable. A student writing by
hand in a supervised room cannot paste from a
language model. But this response treats a
symptom rather than a cause. Timed handwritten
examinations assess a narrow band of
competencies: recall, speed, handwriting
legibility, a poor proxy for the analytical writing
and research synthesis most degree
programmes claim to develop.
"In retreating to supervised handwriting, universities are not upholding educational standards. They are quietly abandoning some of the most important ones."
There is also a workforce readiness problem. AI use by law firm professionals increased 315% between 2023 and 2024, with 79% of law firm professionals now incorporating AI tools into their daily work.
5
A graduate who has never worked with AI tools because their university
banned them is three years behind before they
have started.
What does academic integrity actually
mean in the age of AI?
The word integrity comes from the Latin
integritas wholeness, the state of being
undivided. Applied to academic work, it means
that a piece of writing is a genuine expression of
the student who produced it: that the ideas,
reasoning, and conclusions belong to the person who signed their name.
Plagiarism detection software attempted to close
that gap technologically, measuring surface
similarity as a proxy for intellectual authenticity.
What happened next is familiar: the proxy
gradually became the thing being assessed.
Integrity came to mean "not flagged."
Compliance (what we measure)
Not flagged by detection software
Low similarity score
Submitted on time
No institutional record
Generative AI exposed the absurdity of the
compliance definition. If a student submits an
essay written entirely by a language model with
no similarity to any indexed source, Turnitin
returns a low similarity score. Under the proxy
definition, the work is "not plagiarised." Under
any meaningful definition of integrity, it is a
complete failure.
What is the evidence-based alternative
to AI detection?
The answer is not to abandon written
assignments. The written argument is one of the
most cognitively demanding things a student can
do, and research confirms that argumentative
writing assignments develop critical thinking,
evidence evaluation, and the construction of
defensible positions.
6
The essay is not the
problem.
The solution, as Eduface's whitepaper argues, is
to enrich the written assignment with an oral
layer and to do so at scale. This is not a new
idea. In the sixteenth century, all examinations at
Oxford and Cambridge were oral. The doctoral
viva still functions on exactly this principle. The
question worth asking is why this principle stops
at doctoral level.
The oral enrichment process: integrity by design
01
Student
submits paper
02
AI reads
the paper
03
Questions
generated
04
Student
responds
05
Report to
lecturer
A student who did not write their paper cannot answer questions designed from within it.
Not generic comprehension tests, questions that emerge from
that student's own argument and structure.
How does AI oral assessment work at
scale?
The traditional oral examination is time-intensive
and difficult to standardise, which is precisely
why it has been confined to doctoral defences.
Eduface's oral examination tool removes that
constraint. The tool reads the student's paper
and generates deep, contextually specific
questions about it.
If a student's law essay argues that promissory
estoppel was widened excessively in a particular case, the tool might ask: why do you consider that widening excessive rather than necessary?
How does your position relate to the subsequent
case law you cited in paragraph four? A student
who wrote the paper can answer these
questions. A student who did not is immediately
exposed.
In an internal evaluation with 200 students
across pilot institutions, 89% rated Eduface's
AI-generated oral assessment as preferable to
or equal to prior human-only feedback. A
cohort of 800 is handled identically to a cohort
of 80.
How effective is oral enrichment vs
detection-only?
The whitepaper presents a direct comparison of three assessment design approaches on two
dimensions: learning value and integrity
assurance. The results are unambiguous.
Detection only
Written + oral
Combined effectiveness
25%
95%
Learning incentive
None
Strongest
Integrity assurance
Unreliable
Structural
Equity risk
High (ESL bias)
Low
NSS feedback impact
None
Significant
Research by Roediger and Karpicke
demonstrated that the act of retrieving
information being tested on it is itself a
more powerful learning mechanism than
restudying the same material.
7
When students
know they will be asked to explain and defend
their work, they approach the writing differently.
This is integrity by design.
Frequently asked questions
Q
Does oral assessment work for large
cohorts?
Yes, this is the central design principle. The
tool integrates with existing LMS platforms
and runs asynchronously within a student-
chosen window. A module of 800 is handled
identically to a module of 30. The resource
constraint that made traditional viva
examination unscalable is removed entirely.
Q
Is AI oral assessment fair and legally defensible?
The oral layer generates paper specific
questions from the student's own submitted
text. Because questions target that student's
specific arguments, they are resistant to
preparation in advance. A 2024 systematic
review found oral assessments show strong
validity and reliability as integrity
mechanisms.
Q
Does this approach comply with the EU AI
Act?
Eduface processes all student data on
proprietary GPU infrastructure in the
Netherlands and does not pass submissions
to external AI providers. The oral tool
operates as a support tool for lecturer
decision making, the lecturer retains full
control over grading.
Q
Will students resist an oral component on
top of a written assignment?
Experience from pilot institutions suggests
the opposite. When students know an oral
component follows, engagement with the
written work improves. Students write more
carefully, engage more honestly with
sources, and arrive better prepared.
Q
How does this affect NSS scores for
assessment and feedback?
Assessment and feedback has been the
weakest-performing theme in the NSS since
the survey's introduction. The oral layer
provides every student with individualised,
substantive engagement with their specific
work, the kind of feedback that research
consistently identifies as most useful.
The integrity crisis in higher education is real, but it has been misdiagnosed. The problem is not
that students use AI. It is that institutions have
built assessment systems that treat submission
as proof of learning. The answer is assessment
design that makes genuine intellectual ownership
a structural requirement, not an assumption. The
oral enrichment of written assignments achieves
this, at scale.
Free whitepaper
Beyond the Plagiarism Checker
A New Framework for Academic Integrity in the Age of AI
34 pages. 30 academic references. A practical
implementation pathway.
Download the whitepaper
Request a pilot
References
MDPI Information, Vol. 16 (2025). Evaluating AI detection
tools in higher education.
RAID Benchmark. (2024). ACL 2024. Detector
performance variability.
Kumar, S., & Mindzak, M. (2024). Brock University study
on human identification of AI text.
Wall Street Journal / The Daily Cardinal (2025). Blue
book sales data. PaperSurvey.io (2024–25).
NetDocuments / AttorneyJournals (2025). AI use in law
firms up 315%. JDJournal (Sept 2025).
Frontiers in Education (2022). The Challenge of Position-
Taking in Argumentative Writing.
Roediger, H. L., & Karpicke, J. D. (2006). Test-enhanced
learning. Psychological Science, 17(3), 249–255.
Fenton, A. (2025). Reconsidering oral exams.
Educational Researcher. Nallaya et al. (2024). IIER.
Office for Students, NSS 2024 results. Advance HE
(2024) NSS 2024 analysis.