Technology · Execution autonomy

Content Moderation

An agent that moderates user content against policy, auto-actioning clear cases and escalating the ambiguous ones.

33 HAL Score / 40 Accountable

See the worked example ↓

Workflow

1 Content submitted or reported
2 Agent classifies against the policy taxonomy
3 Clear violations actioned automatically
4 Borderline cases queued for human review
5 Appeals routed to a separate human track

Governance design

✓Auto-action limited to high-confidence, clear-cut categories.
✓Hard limit: account-level penalties require a human.
✓Every decision logged with the policy clause applied.
✓Appeals never reviewed by the deciding system.

Escalation paths

↗Confidence below threshold → human moderator.
↗Protected category or newsworthy context → senior review.
↗Any appeal → independent human reviewer.

Ownership model

Head of Trust & Safety, with a policy lead as deputy.

Lessons learned

Reserving account-level penalties for humans kept the highest-impact decisions accountable.
Separating appeals from the deciding system gave users a genuine route to redress.

Worked example

Marketplace Moderation Agent

Execution → Autonomous

A marketplace deployed an agent to moderate listings and act on policy violations at scale.

Initial workflow

The agent removed listings and suspended seller accounts automatically across all categories, with appeals handled by the same system that made the decision.

Risks identified

!Account suspensions (high-impact) were fully automated.
!No confidence threshold; clear and borderline cases treated alike.
!Appeals judged by the deciding system: no real redress.
!No protection for newsworthy or protected-category content.

HAL assessment

Account-level action made this Autonomous, but it lacked the limits, escalation, and independent review that band requires. The appeals design was a structural accountability failure.

Improvements made

✓Limited auto-action to high-confidence listing removals.
✓Made account suspensions a human-only decision.
✓Routed all appeals to an independent human reviewer.
✓Added hard protections for protected and newsworthy categories.
✓Stood up live monitoring of reversal and appeal-success rates.

Domain	Before	After	Change
Ownership	2	4	+2
Authority	2	4	+2
Limits	1	5	+4
Escalation	1	5	+4
Evidence	2	4	+2
Monitoring	2	5	+3
Review	1	4	+3
Liability	2	4	+2

Deployment recommendation

Approved for a constrained autonomous role. Reserving account-level penalties for humans, and separating appeals from the deciding system, were the changes that made it defensible.

Assess your version of this More examples