A Suite of SEO Optimization Tasks

We benchmark AI agents on real SEO workflows across a simulated outdoor e-commerce site. 32 tasks spanning technical audits, content strategy, link building, and local SEO — each graded step-by-step against expert rubrics.

Difficulty levels

Three levels: single-fix, multi-step, and full-scope workflows.

L1 — Single-fix tasks. One SEO issue, one diagnosis, one action (~3-4 steps).

L2 — Multi-step workflows with dependencies between findings (4-6 steps).

L3 — Full-scope programs: site audits, migration planning, governance setup — the kind of work that takes an SEO specialist days (13-25 steps).

Hybrid judge

Programmatic checks for data gathering + LLM rubrics for action quality.

Programmatic checks verify the agent called the right diagnostic tools — no partial credit for guessing.

LLM-graded steps evaluate whether write actions (fixing metadata, adding schema, creating content) were done correctly and comprehensively. Each uses a detailed rubric with the preloaded correct answers.

Realistic environment

Simulated e-commerce site with 34 SEO tools. No hints, just a goal.

Agents work on GreenLeaf Outdoors — a simulated outdoor gear e-commerce site with planted SEO issues across 15 data files. 34 tools (22 read, 12 write) mirror real SEO practitioner workflows. No multiple choice. No hints. Just a goal and an API.

Explore the environment →

Model performance overview

Percentage of tested tasks where the model scored 100% on all required steps.

Opus 4.6 Anthropic

55%

Sonnet 4.6 Anthropic

31%

GPT-5.3 OpenAI

25%

Gemini 3 Pro Google

Pending

Grok 4 xAI

Pending

Results analysis

These results suggest some SEO tasks can be fully automated, while longer-horizon work becomes substantially easier when pairing a human with AI.

Automate: Single-fix tasks and structured multi-step workflows — noindex tags, broken links, missing schema, toxic link disavows, crawl budget fixes, competitive analysis. AI completes these reliably with 90%+ accuracy, handling the kind of work that takes a specialist 15–60 minutes per task.

Co-pilot: Full-scope programs — site audits, migration planning, link building, governance setup. AI gathers the right data and drafts solid reports, but skips post-action validation and misses edge cases that require cross-referencing multiple data sources. These are the multi-day projects where AI handles 70–80% of the steps but a human needs to verify the output before it ships.

Tasks fully solved (100% of required steps)

Level 1 Single-fix

Opus 4.6

100%

Sonnet 4.6

50%

GPT-5.3

25%

Level 2 Multi-step

Opus 4.6

70%

Sonnet 4.6

75%

GPT-5.3

25%

Level 3 Full-scope

Opus 4.6

Sonnet 4.6

GPT-5.3

25%

Average step accuracy across tasks

Level 1 Single-fix

Opus 4.6

100%

Sonnet 4.6

91%

GPT-5.3

65%

Level 2 Multi-step

Opus 4.6

92%

Sonnet 4.6

96%

GPT-5.3

78%

Level 3 Full-scope

Opus 4.6

90%

Sonnet 4.6

78%

GPT-5.3

70%