Academic Integrity

Why AI Plagiarism Detection Is Failing Higher Education And What to Do Instead

AI detection tools have a 15–30% false positive rate. Pen-and-paper retreats are narrowing what universities actually teach. Eduface's new whitepaper argues there is a better framework one built on genuine intellectual ownership rather than surface detection.

Eduface · 11 min read · Learning Technologists, DVCs & Policy Leads

This article summarises Eduface's 2025 whitepaper: Beyond the Plagiarism Checker: A New Framework for Academic Integrity in the Age of AI.

Download the whitepaper →

In the autumn of 2022, a Princeton student built an AI detection tool during his winter

break. Within weeks, GPTZero had been used millions of times. That adoption speed

tells you everything about the anxiety gripping higher education at that moment:

ChatGPT had arrived, institutions had no clear answer, and the first response was

detection. For most of the sector, it still is. That response is not working.

Is AI plagiarism detection reliable enough to use in higher education?

No. Commercial AI detection tools report false positive rates of 15–30%, with

particular bias against multilingual and non-native English-speaking students.

Turnitin's own guidance states that its AI writing detection should not be used as

the sole basis for adverse actions against a student. A detection system unreliable

enough that its creator warns against acting on it is not a quality assurance

mechanism. Eduface's 2025 whitepaper argues that the sector needs a different

framework entirely, one built on verifiable intellectual ownership rather than

surface-level text analysis.

Why is AI plagiarism detection failing in higher education?

The detection approach rests on a premise that has not aged well: that AI-generated

text is reliably distinguishable from human-written text. A 2025 evidence synthesis in

MDPI's Information journal, examining peer-reviewed literature from 2021 to 2024,

found that commercial AI detectors frequently produce false positives and lack

transparency particularly for multilingual and non-native English speakers, whose

more formulaic writing is disproportionately flagged.

The detection approach rests on a premise that has not aged well: that AI-generated

text is reliably distinguishable from human-written text. A 2025 evidence synthesis in

MDPI's Information journal, examining peer-reviewed literature from 2021 to 2024,

found that commercial AI detectors frequently produce false positives and lack

transparency, particularly for multilingual and non-native English speakers, whose

more formulaic writing is disproportionately flagged.

The RAID Benchmark (ACL 2024) found that detector performance shifts substantially

across AI models, domains, and adversarial edits including edits that resemble

normal student revision.

The RAID Benchmark (ACL 2024) found that detector performance shifts substantially

across AI models, domains, and adversarial edits, including edits that resemble

normal student revision.

A Brock University study found that human participants could

identify AI-generated text at a true positive rate of just over 24%: barely above

chance.

Researchers at the University of Maryland have concluded that, as the

statistical distance between AI-generated and human written text continues to narrow,

even optimally calibrated detectors will approach random guessing. This is a structural

limit, not a solvable technical problem.

15–30%

False positive rate in commercial AI detection

tools — meaning a student who wrote their

own work can be flagged for misconduct they

did not commit. Multilingual students are

disproportionately affected.

False positive rate in commercial AI detection

tools, meaning a student who wrote their

own work can be flagged for misconduct they

did not commit. Multilingual students are

disproportionately affected.

MDPI Information, Vol. 16 (2025); Brandeis University AI

Literacy guidance.

The equity dimension is significant. Detection tools calibrated on majority-language

writing patterns systematically disadvantage the students least likely to have any

recourse against a misconduct accusation. A regime with a 15–30% false positive rate

does not enforce fairness. It undermines it.

The detection cycle: a loop with no exit

Essay

Assignment

Student

Engages AI

Detection

Deployed

Detector

Fails

Return to

Paper?

Cycle repeats. No learning outcome improved.

Each institutional response generates the next problem. The arms race has produced no clear winner and

considerable collateral damage to innocent students.

Is going back to pen-and-paper exams the answer to AI cheating?

When detection failed to reassure, many institutions reached for a more drastic

solution: remove the digital environment entirely. Blue book sales at UC Berkeley rose

80% in the 2024–25 academic year. At the University of Florida, the figure was 50%.

Several Russell Group universities reintroduced in-person written exams for courses

that had migrated online during the pandemic, citing AI concerns explicitly.

The logic is understandable. A student writing by hand in a supervised room cannot

paste from a language model. But this response treats a symptom rather than a cause,

and it comes with real educational costs. Timed handwritten examinations under

pressure assess a narrow band of competencies: recall, speed, handwriting legibility.

They are a poor proxy for the analytical writing, structured argumentation, and research

synthesis that most degree programmes claim to develop.

"In retreating to supervised handwriting, universities are not

upholding educational standards. They are quietly abandoning some of the most important ones."

There is also a workforce readiness problem. AI use by law firm professionals

increased 315% between 2023 and 2024, with 79% of law firm professionals now

incorporating AI tools into their daily work.

There is also a workforce readiness problem. AI use by law firm professionals

increased 315% between 2023 and 2024, with 79% of law firm professionals now

incorporating AI tools into their daily work.

In medicine, engineering, finance, and

education the trajectory is comparable. A graduate who has never worked with AI tools

because their university banned them is not better prepared. They are three years

behind before they have started.

What does academic integrity actually mean in the age of AI?

The word integrity comes from the Latin integritas, wholeness, the state of being

undivided. Applied to academic work, it means that a piece of writing is a genuine

expression of the student who produced it: that the ideas, reasoning, and conclusions

belong to the person who signed their name.

Plagiarism detection software, most notably Turnitin after its commercial launch in

1998, attempted to close that gap technologically, measuring surface similarity as a

proxy for intellectual authenticity. What happened next is a familiar institutional story:

the proxy gradually became the thing being assessed. Integrity came to mean "not

flagged."

Compliance definition (what we measure)

Integrity definition (what we should

measure)

Not flagged by detection software

Understands and owns the work

Low similarity score

Can explain and defend arguments

Submitted on time

Engaged genuinely with sources

No institutional record

Demonstrates actual learning

Generative AI exposed the absurdity of the compliance definition. If a student submits

an essay written entirely by a language model with no similarity to any indexed source,

Turnitin returns a low similarity score. Under the proxy definition, the work is "not

plagiarised." Under any meaningful definition of integrity, does this student

understand what they submitted? Can they explain the argument? It is a complete

failure. The proxy and the underlying value had quietly parted ways.

What is the evidence-based alternative to AI detection for academic

integrity?

The answer is not to abandon written assignments. The written argument is one of the

most cognitively demanding things a student can do, and research confirms that

argumentative writing assignments are particularly advantageous for developing critical

thinking, evidence evaluation, and the construction of defensible positions.

The essay

is not the problem. The problem is that we have been treating submission as proof of

understanding, when it has only ever been a proxy for it.

The solution, as Eduface's whitepaper argues, is to enrich the written assignment with

an oral layer and to do so at scale. This is not a new idea. In the sixteenth century, all

examinations at Oxford and Cambridge were oral examinations. The doctoral viva still

functions on exactly this principle. At doctoral level, no university considers submission

alone to be sufficient evidence of intellectual ownership. The conversation is the

verification. The question worth asking is why this principle stops at doctoral level.

The oral enrichment process: integrity by design

Student

submits paper

AI reads

the paper

Questions

generated

Student

responds

Report to

lecturer

A student who did not write their paper cannot answer questions designed from within it.

Eduface's oral examination tool reads the submitted paper and generates deep, paper-specific questions. Not

generic comprehension tests — questions that emerge from that student's own argument, evidence, and

structure.

How does AI oral assessment work at scale in higher education?

The traditional oral examination is time-intensive and difficult to standardise at scale which is precisely why it has been confined to doctoral defences. A viva requires a

trained examiner, a scheduled session, and significant time per student. In a cohort of

three hundred, that is prohibitive.

Eduface's oral examination tool removes that constraint. The tool reads through the

student's submitted paper and generates deep, contextually specific questions about

it: not generic comprehension questions, but questions derived from the content,

structure, and argumentation of that particular paper. If a student's law essay argues

that promissory estoppel was widened excessively in a particular case, the tool might

ask: why do you consider that widening excessive rather than necessary? What would

a counter-argument look like? How does your position relate to the subsequent case

law you cited in paragraph four? A student who wrote the paper can answer these

questions. A student who did not is immediately exposed.

In an internal evaluation conducted with 200 students across pilot institutions, 89%

of participants rated Eduface's AI-generated oral assessment as preferable to or

equal to prior human-only feedback. The tool integrates with Blackboard,

Brightspace, Moodle, and Canvas through standard LTI protocols. A cohort of 800 is

handled identically to a cohort of 80. Eduface holds approved supplier status on the

Jisc/CHEST framework in the UK and the HEAnet framework in Ireland.

How effective is oral enrichment compared with detection-only

approaches?

The whitepaper presents a direct comparison of three assessment design approaches

on two dimensions: learning value and integrity assurance. The results are

unambiguous.

Detection only

Written assignment

alone

Written + oral

layer

Combined effectiveness

25%

65%

95%

Learning incentive

None

Strong

Strongest

Integrity assurance

Unreliable

Limited

Structural

Equity risk

High (ESL bias)

Low

NSS feedback impact

None

Moderate

Significant

The learning value dimension deserves emphasis. Research by Roediger and Karpicke

at Washington University and Purdue University demonstrated that the act of retrieving

information being tested on it is itself a more powerful learning mechanism than

restudying the same material.

When students know they will be asked to explain and

defend their work in a follow up oral component, they approach the writing differently.

They cannot afford to submit something they do not understand. This is integrity by

design: the oral layer creates an incentive for genuine engagement from the moment

the assignment is set, not a sanction after the fact.

Frequently asked questions

Does oral assessment work for large cohorts in higher education?

Yes, this is the central design principle. The Eduface oral examination tool integrates

with existing LMS platforms and runs asynchronously within a student-chosen

window. A module of 800 is handled identically to a module of 30. There is no

additional staffing requirement; the resource constraint that made traditional viva

examination unscalable is removed entirely.

Is AI oral assessment fair and legally defensible as an integrity mechanism?

The oral layer generates paper-specific questions from the student's own submitted

text. Because questions target that student's specific arguments, sources, and

structure, they are resistant to preparation in advance. A 2024 systematic review in

IIER found that oral assessments show strong validity and reliability as integrity

mechanisms. Unlike AI detection, the outcome is a positive demonstration of

understanding rather than a probabilistic inference about text origin.

Does this approach comply with the EU AI Act?

Eduface processes all student data on proprietary GPU infrastructure in the

Netherlands and does not pass submissions to external AI providers. The oral

assessment tool operates as a support tool for lecturer decision-making; the lecturer

retains full control over grading. Both the AI assessment and oral tools satisfy the

human oversight requirements of Article 14 of the EU AI Act (Regulation 2024/1689).

Will students resist an oral component on top of a written assignment?

Experience from pilot institutions suggests the opposite. When students know an oral

component follows, engagement with the written work improves. Students write more

carefully, engage more honestly with sources, and arrive at the oral component better

prepared to discuss their own ideas. The anticipation effect, knowing you will need

to account for what you wrote, produces more genuine intellectual engagement than

any detection tool has ever achieved.

How does this affect NSS scores for assessment and feedback?

Assessment and feedback has been the weakest-performing theme in the NSS since

the survey's introduction, with particular concerns about whether feedback is specific

and acted upon. The oral layer provides every student with individualised, substantive

engagement with their specific work, the kind of feedback that research

consistently identifies as most useful for learning. NSS data suggests this addresses

the area where student dissatisfaction is most acute and most persistent.

The integrity crisis in higher education is real, but it has been misdiagnosed. The

problem is not that students use AI. It is that institutions have built assessment systems

that treat submission as proof of learning, and have responded to the challenge of

generative AI by trying and failing to police the gap between what students submit

and what they understand. That gap has always existed. AI made it visible. The answer

is assessment design that makes genuine intellectual ownership a structural

requirement, not an assumption. The oral enrichment of written assignments achieves

this, at scale, and with a learning benefit that no detection tool can provide.

Free whitepaper

Beyond the Plagiarism Checker

A New Framework for Academic Integrity in the Age of AI

34 pages. 30 academic references. A practical implementation

pathway for institutions ready to move beyond detection.

Request a pilot

References

MDPI Information, Vol. 16 (2025). Evaluating AI detection tools in higher education: qualitative evidence

synthesis of peer-reviewed literature, 2021–2024.

RAID Benchmark. (2024). ACL 2024. Detector performance variability across AI models, domains, and

adversarial edits.

Kumar, S., & Mindzak, M. (2024). Brock University study on human identification of AI-generated text;

Brandeis University AI Literacy guidance on detection tool limitations.

Wall Street Journal / The Daily Cardinal (2025). Blue book sales data, University of Florida and UC

Berkeley. PaperSurvey.io — Universities Returning to Paper Exams (2024–25).

NetDocuments / AttorneyJournals (2025). AI-Driven Legal Tech Trends. AI use in law firms up 315%,

2023–2024; 79% law firm professionals using AI daily. JDJournal (Sept 2025) — U.S. Law Schools

Make AI Training Mandatory.

Frontiers in Education (2022). The Challenge of Position-Taking in Argumentative Writing.

Argumentation valuable for critical thinking and evidence evaluation.

Roediger, H. L., & Karpicke, J. D. (2006). Test-enhanced learning: Taking memory tests improves long-

term retention. Psychological Science, 17(3), 249–255.

Fenton, A. (2025). Reconsidering the use of oral exams and assessments. Educational Researcher

(SAGE). Nallaya et al. (2024). Validity, reliability and academic integrity of oral assessments: a

systematic review. IIER.

Office for Students NSS 2024 results. Assessment & Feedback consistently lowest-scoring theme

across UK sector. Advance HE (2024) NSS 2024 analysis.

Academic Integrity

Why AI Plagiarism Detection Is

Failing Higher Education — And

What to Do Instead

AI detection tools have a 15–30% false positive

rate. Pen-and-paper retreats are narrowing what

universities actually teach. Eduface's new

whitepaper argues there is a better framework

one built on genuine intellectual ownership rather than surface detection.

Eduface

11 min read

DVCs, Policy Leads & Learning Technologists

This article summarises Eduface's 2025 whitepaper:

Beyond the Plagiarism Checker: A New Framework for

Academic Integrity in the Age of AI.

Download the whitepaper →

In the autumn of 2022, a Princeton student built

an AI detection tool during his winter break.

Within weeks, GPTZero had been used millions

of times. That adoption speed tells you

everything about the anxiety gripping higher

education at that moment: ChatGPT had arrived,

institutions had no clear answer, and the first

response was detection. For most of the sector, it

still is. That response is not working.

Is AI plagiarism detection reliable enough to

use in higher education?

No. Commercial AI detection tools report false

positive rates of 15–30%, with particular bias

against multilingual and non-native English-

speaking students. Turnitin's own guidance

states that its AI writing detection should not

be used as the sole basis for adverse actions

against a student. A detection system

unreliable enough that its creator warns

against acting on it is not a quality assurance

mechanism.

Why is AI plagiarism detection failing in

higher education?

The detection approach rests on a premise that

has not aged well: that AI-generated text is

reliably distinguishable from human-written text.

A 2025 evidence synthesis in MDPI's Information journal, examining peer-reviewed literature from 2021 to 2024, found that commercial AI detectors frequently produce false positives and lack transparency, particularly for multilingual and non-native English speakers, whose more formulaic writing is disproportionately flagged.

The RAID Benchmark (ACL 2024) found that

detector performance shifts substantially across

AI models, domains, and adversarial edits, including edits that resemble normal student

revision.

A Brock University study found that

human participants could identify AI-generated

text at a true positive rate of just over 24%:

barely above chance.

15–30%

False positive rate in commercial AI detection

tools meaning a student who wrote their

own work can be flagged for misconduct they did not commit. Multilingual students are

disproportionately affected.

MDPI Information, Vol. 16 (2025); Brandeis University AI Literacy

guidance.

The equity dimension is significant. Detection

tools calibrated on majority-language writing

patterns systematically disadvantage the

students least likely to have any recourse against

a misconduct accusation. A regime with a 15

30% false positive rate does not enforce fairness

it undermines it.

The detection cycle: a loop with no exit

Essay

Assignment

Student

Engages AI

Detection

Deployed

Detector

Fails

Return to

Paper?

Cycle repeats. No learning outcome improved.

The arms race has produced no clear winner and considerable

collateral damage to innocent students.

Is going back to pen-and-paper exams

the answer?

When detection failed to reassure, many

institutions reached for a more drastic solution:

remove the digital environment entirely. Blue

book sales at UC Berkeley rose 80% in the

2024–25 academic year. At the University of

Florida, the figure was 50%. Several Russell

Group universities reintroduced in-person written exams.

The logic is understandable. A student writing by

hand in a supervised room cannot paste from a

language model. But this response treats a

symptom rather than a cause. Timed handwritten

examinations assess a narrow band of

competencies: recall, speed, handwriting

legibility, a poor proxy for the analytical writing

and research synthesis most degree

programmes claim to develop.

"In retreating to supervised handwriting, universities are not upholding educational standards. They are quietly abandoning some of the most important ones."

There is also a workforce readiness problem. AI use by law firm professionals increased 315% between 2023 and 2024, with 79% of law firm professionals now incorporating AI tools into their daily work.

A graduate who has never worked with AI tools because their university

banned them is three years behind before they

have started.

What does academic integrity actually

mean in the age of AI?

The word integrity comes from the Latin

integritas wholeness, the state of being

undivided. Applied to academic work, it means

that a piece of writing is a genuine expression of

the student who produced it: that the ideas,

reasoning, and conclusions belong to the person who signed their name.

Plagiarism detection software attempted to close

that gap technologically, measuring surface

similarity as a proxy for intellectual authenticity.

What happened next is familiar: the proxy

gradually became the thing being assessed.

Integrity came to mean "not flagged."

Compliance (what we measure)

Not flagged by detection software

Low similarity score

Submitted on time

No institutional record

Generative AI exposed the absurdity of the

compliance definition. If a student submits an

essay written entirely by a language model with

no similarity to any indexed source, Turnitin

returns a low similarity score. Under the proxy

definition, the work is "not plagiarised." Under

any meaningful definition of integrity, it is a

complete failure.

What is the evidence-based alternative

to AI detection?

The answer is not to abandon written

assignments. The written argument is one of the

most cognitively demanding things a student can

do, and research confirms that argumentative

writing assignments develop critical thinking,

evidence evaluation, and the construction of

defensible positions.

The essay is not the

problem.

The solution, as Eduface's whitepaper argues, is

to enrich the written assignment with an oral

layer and to do so at scale. This is not a new

idea. In the sixteenth century, all examinations at

Oxford and Cambridge were oral. The doctoral

viva still functions on exactly this principle. The

question worth asking is why this principle stops

at doctoral level.

The oral enrichment process: integrity by design

Student

submits paper

AI reads

the paper

Questions

generated

Student

responds

Report to

lecturer

A student who did not write their paper cannot answer questions designed from within it.

Not generic comprehension tests, questions that emerge from

that student's own argument and structure.

How does AI oral assessment work at

scale?

The traditional oral examination is time-intensive

and difficult to standardise, which is precisely

why it has been confined to doctoral defences.

Eduface's oral examination tool removes that

constraint. The tool reads the student's paper

and generates deep, contextually specific

questions about it.

If a student's law essay argues that promissory

estoppel was widened excessively in a particular case, the tool might ask: why do you consider that widening excessive rather than necessary?

How does your position relate to the subsequent

case law you cited in paragraph four? A student

who wrote the paper can answer these

questions. A student who did not is immediately

exposed.

In an internal evaluation with 200 students

across pilot institutions, 89% rated Eduface's

AI-generated oral assessment as preferable to

or equal to prior human-only feedback. A

cohort of 800 is handled identically to a cohort

of 80.

How effective is oral enrichment vs

detection-only?

The whitepaper presents a direct comparison of three assessment design approaches on two

dimensions: learning value and integrity

assurance. The results are unambiguous.

Detection only

Written + oral

Combined effectiveness

25%

95%

Learning incentive

None

Strongest

Integrity assurance

Unreliable

Structural

Equity risk

High (ESL bias)

Low

NSS feedback impact

None

Significant

Research by Roediger and Karpicke

demonstrated that the act of retrieving

information being tested on it is itself a

more powerful learning mechanism than

restudying the same material.

When students

know they will be asked to explain and defend

their work, they approach the writing differently.

This is integrity by design.

Frequently asked questions

Does oral assessment work for large

cohorts?

Yes, this is the central design principle. The

tool integrates with existing LMS platforms

and runs asynchronously within a student-

chosen window. A module of 800 is handled

identically to a module of 30. The resource

constraint that made traditional viva

examination unscalable is removed entirely.

Is AI oral assessment fair and legally defensible?

The oral layer generates paper specific

questions from the student's own submitted

text. Because questions target that student's

specific arguments, they are resistant to

preparation in advance. A 2024 systematic

review found oral assessments show strong

validity and reliability as integrity

mechanisms.

Does this approach comply with the EU AI

Act?

Eduface processes all student data on

proprietary GPU infrastructure in the

Netherlands and does not pass submissions

to external AI providers. The oral tool

operates as a support tool for lecturer

decision making, the lecturer retains full

control over grading.

Will students resist an oral component on

top of a written assignment?

Experience from pilot institutions suggests

the opposite. When students know an oral

component follows, engagement with the

written work improves. Students write more

carefully, engage more honestly with

sources, and arrive better prepared.

How does this affect NSS scores for

assessment and feedback?

Assessment and feedback has been the

weakest-performing theme in the NSS since

the survey's introduction. The oral layer

provides every student with individualised,

substantive engagement with their specific

work, the kind of feedback that research

consistently identifies as most useful.

The integrity crisis in higher education is real, but it has been misdiagnosed. The problem is not

that students use AI. It is that institutions have

built assessment systems that treat submission

as proof of learning. The answer is assessment

design that makes genuine intellectual ownership

a structural requirement, not an assumption. The

oral enrichment of written assignments achieves

this, at scale.

Free whitepaper

Beyond the Plagiarism Checker

A New Framework for Academic Integrity in the Age of AI

34 pages. 30 academic references. A practical

implementation pathway.

Download the whitepaper

Request a pilot

References

MDPI Information, Vol. 16 (2025). Evaluating AI detection

tools in higher education.

RAID Benchmark. (2024). ACL 2024. Detector

performance variability.

Kumar, S., & Mindzak, M. (2024). Brock University study

on human identification of AI text.

Wall Street Journal / The Daily Cardinal (2025). Blue

book sales data. PaperSurvey.io (2024–25).

NetDocuments / AttorneyJournals (2025). AI use in law

firms up 315%. JDJournal (Sept 2025).

Frontiers in Education (2022). The Challenge of Position-

Taking in Argumentative Writing.

Roediger, H. L., & Karpicke, J. D. (2006). Test-enhanced

learning. Psychological Science, 17(3), 249–255.

Fenton, A. (2025). Reconsidering oral exams.

Educational Researcher. Nallaya et al. (2024). IIER.

Office for Students, NSS 2024 results. Advance HE

(2024) NSS 2024 analysis.