Assessment & Feedback

Can AI Mark Papers Fairly? What the

Research and Our Pilots Actually Show

AI essay marking is compared against a human gold standard that is

shakier than it looks. Here is what the research and live pilot data

actually show about reliability, consistency, and fairness at scale.

Eduface

·

9 min read

·

Written for lecturers & academic leads

You returned 80 essays last semester. Somewhere in the back of your mind, a question

lingers: would you score them the same way if you marked them again next week?

Research suggests the honest answer is: probably not quite. If marking variability is a

problem for experienced humans, what does that mean for AI essay marking? And can an

automated system actually do this fairly?

Can AI mark essays fairly?

AI essay marking, when implemented with rubric-based methodology and mandatory

human oversight, can match or exceed the consistency of human inter-rater

agreement. In Eduface's UK pilot programmes, AI marks aligned with lecturer marks

95% of the time. Fairness depends not on whether a human or an algorithm marks

first, but on whether the criteria are clear, the process is transparent, and the final

grade remains under human control.

How reliable is human essay marking, really?

This is where the conversation needs to start. AI essay marking is routinely compared

against a gold standard that, on examination, is far from solid. Research on marking

reliability in UK higher education reveals a picture that should give every institution pause.

Bloxham (2009) reviewed marking and moderation practices across British universities

and concluded that current moderation procedures create substantial administrative

burden while adding little to actual mark accuracy.

1

The assumption that double-marking

or moderation produces reliable grades is, in her analysis, built on false premises rather

than evidence.

Sadler (2009) identified a related problem: grade integrity requires not only consistent

criteria but also that markers interpret those criteria in comparable ways.

2

In practice, two

experienced academics marking the same essay against the same rubric will frequently

arrive at different grades. Studies on general impression marking consistently show low

inter-rater correlation, with rubric-based assessment producing meaningfully higher but

still imperfect agreement.

Marking Reliability: Human vs AI

Agreement rates across marking approaches

40%

69%

95%

General impression

marking

Rubric-based

human marking

Eduface AI

alignment (UK pilots)

Agreement with reference mark

Sources: Bloxham (2009); Eduface UK pilot data (2024–2025)

Figure 1: Agreement rates across different marking approaches. General impression marking shows low inter-rater

reliability; rubric-based human marking improves consistency; Eduface AI achieves 95% alignment with lecturer

marks in UK pilots.

This does not mean human judgement is wrong. It means that marking is harder and more

variable than institutions typically acknowledge. Any serious conversation about AI essay

marking has to start with this context rather than treating human marking as a fixed,

reliable benchmark.

What does "fair" AI marking actually mean?

Fairness in assessment has several distinct dimensions. It is not simply a question of

whether an AI or a human holds the pen.

Hattie and Timperley (2007) identified that effective feedback must be accurate, timely,

and tied to clear learning goals.

3

Fairness, in that framework, requires that every student

receives feedback that genuinely helps them improve. When marking is delayed by three

to six weeks because of workload constraints, or when feedback varies sharply

depending on which marker a student happens to receive, the system is already failing

the fairness test before AI enters the room.

A fairer approach would look like this:

Clear, pre-defined rubric criteria shared with students before submission.

Consistent application of those criteria across all submissions in a cohort.

Timely, specific feedback that identifies what the student did well and what needs to

develop.

Human review and sign-off before grades are released.

AI essay marking, when properly implemented, delivers on all four points. Inconsistency

between markers disappears. Turnaround drops from weeks to days. The lecturer retains

final control over every grade.

How does AI essay marking work in practice?

The term "AI essay marking" covers a wide range of implementations, from simple

keyword detection to sophisticated language models trained on large datasets of

assessed student work. What matters for fairness is not the technology itself but the

methodology and governance around it.

Eduface AI Marking Workflow

1

Student submits

via LMS

2

AI scores against

rubric criteria

3

Lecturer reviews

and may edit

4

Grade released

with feedback

5

Student acts

on feedback

Blind mode option

Lecturer marks first without seeing AI grades.

AI grades revealed afterwards for calibration.

AI-visible mode option

AI marks shown upfront. Lecturer edits

or overrides before any grade is released.

Figure 2: Eduface's AI marking workflow. Two lecturer modes are available: blind mode (lecturer marks

independently first, AI grades revealed for calibration afterwards) and AI-visible mode (lecturer reviews and can edit

AI marks before release). The lecturer holds final authority in both modes.

Eduface supports two marking modes, both of which keep the lecturer in control. In blind

mode, the lecturer completes their own marking before Eduface reveals the AI grades.

This allows lecturers to check their own consistency and remove bias from their initial

marking. In AI-visible mode, the AI grades are shown upfront and the lecturer can edit or

override any mark before it is released to students. Institutions can set which mode is

available or mandatory for their staff, giving them governance over the process at an

institutional level.

Does Eduface's AI marking hold up against lecturer judgement?

Pilot data from UK institutions using Eduface shows 95% alignment between AI-

generated marks and lecturer marks. This figure is derived from live assessments across

written assignments and exam questions, with marks verified against the same rubric by

both the AI and the human marker.

The 5% of cases where the AI and the lecturer diverge are precisely the cases where

human review adds the most value: edge cases, ambiguous arguments, or work that

requires contextual knowledge the rubric does not fully capture. Eduface flags these

cases for closer review rather than releasing them automatically.

Pilot finding: Across UK pilot programmes including Bath Spa University, Eduface

AI marks aligned with lecturer marks in 95% of cases. Where divergence

occurred, lecturer override took an average of under three minutes per

assignment.

Falchikov and Goldfinch (2000) conducted a meta-analysis comparing peer assessment

marks to lecturer marks and found correlations typically in the range of 0.60 to 0.80,

depending on the assessment type and training provided.

4

AI marking, trained on verified

assessor data and applied against a consistent rubric, outperforms this benchmark.

What does the EU AI Act require for AI essay marking?

The EU AI Act (Regulation 2024/1689) classifies AI systems used in educational

assessment as high-risk under Annex III, point 3(b).

5

This classification covers automated

exam scoring, student placement decisions, and evaluation of academic performance.

High-risk classification does not prohibit use: it requires compliance.

The two most relevant obligations for institutions deploying AI essay marking tools are

Article 14 (human oversight, meaning a qualified human must be able to review, override,

and bear responsibility for every consequential decision) and Article 13 (transparency,

meaning students and staff must be informed that AI is involved and understand how the

system reaches its outputs).

Eduface is designed around both requirements. Human override is built into the workflow,

not added as an afterthought. Feedback generated by the AI explains the reasoning

behind each mark, rather than delivering a score without justification. Institutions in the

UK and EU can deploy Eduface with confidence that the governance model is aligned with

regulatory expectations.

Assessment approach

Consistency

Turnaround

Feedback quality

Human oversight

EU AI Act compliant

General impression

marking (human)

Low

Variable

Variable

Yes

N/A

Rubric-based marking

(human)

Moderate

Variable

Moderate

Yes

N/A

Unreviewed AI grading

(no human step)

High

Fast

Moderate

No

No

Eduface: blind mode

High

Fast

High

Yes

Yes

Eduface: AI-visible

mode

High

Fast

High

Yes

Yes

Frequently asked questions

Can AI marking detect plagiarism or AI-generated student work?

AI marking tools such as Eduface assess the quality and content of a submitted piece of

work against a rubric. Plagiarism and AI-generated content detection are separate

functions, typically handled by dedicated tools such as Turnitin. The two systems are

complementary and should be used in conjunction, not as substitutes for each other.

Will students trust an AI-marked grade?

Student trust depends on transparency. When students are told in advance that AI

provides a first-pass mark, that a lecturer reviews every grade, and that they can request a

human review, trust levels are comparable to existing marking processes. In NSS-focused

institutions, the bigger trust issue is often delayed, generic feedback rather than who

produced it.

How does Eduface handle subjectivity in essay marking?

Eduface marks against the rubric criteria the lecturer defines. Where the rubric captures

the judgement (argument quality, use of evidence, structure), the AI applies it consistently.

Where the rubric cannot capture nuance, Eduface flags the case for closer lecturer review.

The system is designed to support, not replace, academic judgement on genuinely

complex cases.

What types of written assessment can Eduface mark?

Eduface covers written assignments (essays, case studies, reflective reports, short-

answer questions), written exam questions, and open-ended assessments. Eduface also

has a dedicated model for oral and spoken examinations. The system operates across all

major LMS platforms including Blackboard, Brightspace, Moodle, and Canvas.

Is Eduface approved for use in UK institutions?

Yes. Eduface is an approved supplier on the Jisc/CHEST framework, which means UK

institutions can procure the platform without running a separate tender process. Eduface is

also on the HEAnet framework in Ireland. All processing runs on proprietary GPU

infrastructure in the Netherlands and does not rely on third-party AI APIs such as OpenAI.

The question is not whether AI can mark fairly

The evidence from both academic research and live pilot programmes shows that rubric-

based AI assessment, with mandatory human review, produces marks that are at least as

consistent as human-to-human marking. The more accurate question is whether an

institution's current marking process is as consistent and fair as it believes. For most, the

honest answer requires examining the evidence with the same rigour they would apply to

any other quality assurance question.

See Eduface in action

Find out how Eduface fits into your institution's assessment

workflow. Request a free demo or create a free lecturer account to

try it with your own assignments.

Request a demo

References

Bloxham, S. (2009). Marking and moderation in the UK: false assumptions and wasted resources.

Assessment & Evaluation in Higher Education, 34(2), 209–220.

Sadler, D. R. (2009). Grade integrity and the representation of academic achievement.

Studies in Higher

Education, 34

(7), 807–826.

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81–112.

Falchikov, N., & Goldfinch, J. (2000). Student peer assessment in higher education: A meta-analysis

comparing peer and teacher marks. Review of Educational Research, 70(3), 287–322.

European Parliament and Council of the EU. (2024). Regulation (EU) 2024/1689 (Artificial Intelligence

Act). Official Journal of the European Union.

Assessment & Feedback

Can AI Mark Papers Fairly? What the Research and Our Pilots Actually Show

AI essay marking is compared against a human gold

standard that is shakier than it looks. Here is what the

research and live pilot data actually show about

reliability, consistency, and fairness.

Eduface

·

9 min read

·

Written for lecturers & academic leads

You returned 80 essays last semester.

Somewhere in the back of your mind, a question

lingers: would you score them the same way if

you marked them again next week? Research

suggests the honest answer is: probably not

quite. If marking variability is a problem for

experienced humans, what does that mean for AI

essay marking? And can an automated system

actually do this fairly?

Can AI mark essays fairly?

AI essay marking, when implemented with

rubric-based methodology and mandatory

human oversight, can match or exceed the

consistency of human inter-rater agreement.

In Eduface's UK pilot programmes, AI marks

aligned with lecturer marks 95% of the time.

Fairness depends not on whether a human or

an algorithm marks first, but on whether the

criteria are clear, the process is transparent,

and the final grade remains under human

control.

How reliable is human essay

marking, really?

This is where the conversation needs to start. AI

essay marking is routinely compared against a

gold standard that, on examination, is far from

solid. Research on marking reliability in UK

higher education reveals a picture that should

give every institution pause.

Bloxham (2009) reviewed marking and

moderation practices across British universities

and concluded that current moderation

procedures create substantial administrative

burden while adding little to actual mark

accuracy.

1

The assumption that double-marking

or moderation produces reliable grades is, in her

analysis, built on false premises rather than

evidence.

Sadler (2009) identified a related problem: grade

integrity requires not only consistent criteria but

also that markers interpret those criteria in

comparable ways.

2

In practice, two experienced

academics marking the same essay against the

same rubric will frequently arrive at different

grades.

Marking Reliability: Human vs AI

Agreement rates across marking approaches

40%

69%

95%

General impression

marking

Rubric-based

human marking

Eduface AI

(UK pilots)

Figure 1: Agreement rates across marking approaches.

Eduface AI achieves 95% alignment with lecturer marks in

UK pilots.

What does "fair" AI marking

actually mean?

Fairness in assessment has several distinct

dimensions. It is not simply a question of

whether an AI or a human holds the pen.

A fairer approach would look like this:

Clear, pre-defined rubric criteria shared with

students before submission.

Consistent application of those criteria across

all submissions in a cohort.

Timely, specific feedback that identifies what

the student did well and what needs to

develop.

Human review and sign-off before grades are

released.

How does AI essay marking work

in practice?

The term "AI essay marking" covers a wide

range of implementations. What matters for

fairness is not the technology itself but the

methodology and governance around it.

Eduface AI Marking Workflow

1

Student submits

via LMS

2

AI scores against

rubric criteria

3

Lecturer reviews

and may edit

4

Grade released

with feedback

5

Student acts

on feedback

Blind mode

Lecturer marks first; AI grades revealed after.

AI-visible mode

AI marks shown upfront; lecturer edits before release.

Figure 2: Eduface's AI marking workflow. The lecturer holds

final authority in both modes.

Does Eduface's AI marking hold

up against lecturer judgement?

Pilot data from UK institutions using Eduface

shows 95% alignment between AI-generated

marks and lecturer marks across written

assignments and exam questions.

Pilot finding: Across UK pilot programmes

including Bath Spa University, Eduface AI

marks aligned with lecturer marks in 95% of

cases. Where divergence occurred, lecturer

override took an average of under three

minutes per assignment.

What does the EU AI Act require

for AI essay marking?

The EU AI Act classifies AI systems used in

educational assessment as high-risk under

Annex III, point 3(b). The two most relevant

obligations are Article 14 (human oversight) and

Article 13 (transparency). Eduface is designed

around both requirements.

Approach

Consistency

Turnaround

Oversight

Compliant

General

impression

(human)

Low

Variable

Yes

N/A

Rubric-based

(human)

Moderate

Variable

Yes

N/A

Unreviewed AI

High

Fast

No

No

Eduface: blind

mode

High

Fast

Yes

Yes

Eduface: AI-

visible

High

Fast

Yes

Yes

Frequently asked questions

Can AI marking detect plagiarism or AI-

generated student work?

AI marking tools such as Eduface assess

quality against a rubric. Plagiarism and AI-

content detection are separate functions,

typically handled by dedicated tools such as

Turnitin. The two systems are complementary.

Will students trust an AI-marked grade?

Student trust depends on transparency. When

students are told in advance that AI provides a

first-pass mark, that a lecturer reviews every

grade, and that they can request a human

review, trust levels are comparable to existing

marking processes.

How does Eduface handle subjectivity in

essay marking?

Eduface marks against the rubric criteria the

lecturer defines. Where the rubric cannot

capture nuance, Eduface flags the case for

closer lecturer review. The system is designed

to support, not replace, academic judgement.

What types of written assessment can

Eduface mark?

Eduface covers essays, case studies,

reflective reports, short-answer questions,

written exam questions, and open-ended

assessments. It operates across Blackboard,

Brightspace, Moodle, and Canvas.

Is Eduface approved for use in UK

institutions?

Yes. Eduface is an approved supplier on the

Jisc/CHEST framework. UK institutions can

procure the platform without running a

separate tender process. Eduface is also on

the HEAnet framework in Ireland.

The question is not whether AI

can mark fairly

The evidence from both academic research and

live pilot programmes shows that rubric-based AI

assessment, with mandatory human review,

produces marks that are at least as consistent as

human-to-human marking.

See Eduface in action

Request a free demo or create a free

lecturer account to try it with your own

assignments.

Request a demo

Request a demo

References

Bloxham, S. (2009). Marking and moderation in the

UK: false assumptions and wasted resources.

Assessment & Evaluation in Higher Education, 34(2),

209–220.

Sadler, D. R. (2009). Grade integrity and the

representation of academic achievement. Studies in

Higher Education, 34(7), 807–826.

Hattie, J., & Timperley, H. (2007). The power of

feedback. Review of Educational Research, 77(1), 81–

112.

Falchikov, N., & Goldfinch, J. (2000). Student peer

assessment in higher education. Review of

Educational Research, 70(3), 287–322.

European Parliament and Council of the EU. (2024).

Regulation (EU) 2024/1689 (Artificial Intelligence Act).