Back

A Suite of SEO Optimization Tasks

We benchmark AI agents on real SEO workflows across a simulated outdoor e-commerce site. 32 tasks spanning technical audits, content strategy, link building, and local SEO — each graded step-by-step against expert rubrics.

Difficulty levels

3

Three levels: single-fix, multi-step, and full-scope workflows.

L1 — Single-fix tasks. One SEO issue, one diagnosis, one action (~3-4 steps).

L2 — Multi-step workflows with dependencies between findings (4-6 steps).

L3 — Full-scope programs: site audits, migration planning, governance setup — the kind of work that takes an SEO specialist days (13-25 steps).

Hybrid judge

J

Programmatic checks for data gathering + LLM rubrics for action quality.

Programmatic checks verify the agent called the right diagnostic tools — no partial credit for guessing.

LLM-graded steps evaluate whether write actions (fixing metadata, adding schema, creating content) were done correctly and comprehensively. Each uses a detailed rubric with the preloaded correct answers.

Realistic environment

E

Simulated e-commerce site with 34 SEO tools. No hints, just a goal.

Agents work on GreenLeaf Outdoors — a simulated outdoor gear e-commerce site with planted SEO issues across 15 data files. 34 tools (22 read, 12 write) mirror real SEO practitioner workflows. No multiple choice. No hints. Just a goal and an API.

Explore the environment →

Model performance overview

Percentage of tested tasks where the model scored 100% on all required steps.

Opus 4.6 Anthropic
55%
Sonnet 4.6 Anthropic
31%
GPT-5.3 OpenAI
25%
Gemini 3 Pro Google
Pending
Grok 4 xAI
Pending

Results analysis

These results suggest some SEO tasks can be fully automated, while longer-horizon work becomes substantially easier when pairing a human with AI.

Automate: Single-fix tasks and structured multi-step workflows — noindex tags, broken links, missing schema, toxic link disavows, crawl budget fixes, competitive analysis. AI completes these reliably with 90%+ accuracy, handling the kind of work that takes a specialist 15–60 minutes per task.

Co-pilot: Full-scope programs — site audits, migration planning, link building, governance setup. AI gathers the right data and drafts solid reports, but skips post-action validation and misses edge cases that require cross-referencing multiple data sources. These are the multi-day projects where AI handles 70–80% of the steps but a human needs to verify the output before it ships.

Tasks fully solved (100% of required steps)

Level 1 Single-fix

Opus 4.6
100%
Sonnet 4.6
50%
GPT-5.3
25%

Level 2 Multi-step

Opus 4.6
70%
Sonnet 4.6
75%
GPT-5.3
25%

Level 3 Full-scope

Opus 4.6
0%
Sonnet 4.6
0%
GPT-5.3
25%

Average step accuracy across tasks

Level 1 Single-fix

Opus 4.6
100%
Sonnet 4.6
91%
GPT-5.3
65%

Level 2 Multi-step

Opus 4.6
92%
Sonnet 4.6
96%
GPT-5.3
78%

Level 3 Full-scope

Opus 4.6
90%
Sonnet 4.6
78%
GPT-5.3
70%