A Suite of SEO Optimization Tasks
We benchmark AI agents on real SEO workflows across a simulated outdoor e-commerce site. 32 tasks spanning technical audits, content strategy, link building, and local SEO — each graded step-by-step against expert rubrics.
Difficulty levels
3Three levels: single-fix, multi-step, and full-scope workflows.
L1 — Single-fix tasks. One SEO issue, one diagnosis, one action (~3-4 steps).
L2 — Multi-step workflows with dependencies between findings (4-6 steps).
L3 — Full-scope programs: site audits, migration planning, governance setup — the kind of work that takes an SEO specialist days (13-25 steps).
Hybrid judge
JProgrammatic checks for data gathering + LLM rubrics for action quality.
Programmatic checks verify the agent called the right diagnostic tools — no partial credit for guessing.
LLM-graded steps evaluate whether write actions (fixing metadata, adding schema, creating content) were done correctly and comprehensively. Each uses a detailed rubric with the preloaded correct answers.
Realistic environment
ESimulated e-commerce site with 34 SEO tools. No hints, just a goal.
Agents work on GreenLeaf Outdoors — a simulated outdoor gear e-commerce site with planted SEO issues across 15 data files. 34 tools (22 read, 12 write) mirror real SEO practitioner workflows. No multiple choice. No hints. Just a goal and an API.
Model performance overview
Percentage of tested tasks where the model scored 100% on all required steps.
Results analysis
These results suggest some SEO tasks can be fully automated, while longer-horizon work becomes substantially easier when pairing a human with AI.
Automate: Single-fix tasks and structured multi-step workflows — noindex tags, broken links, missing schema, toxic link disavows, crawl budget fixes, competitive analysis. AI completes these reliably with 90%+ accuracy, handling the kind of work that takes a specialist 15–60 minutes per task.
Co-pilot: Full-scope programs — site audits, migration planning, link building, governance setup. AI gathers the right data and drafts solid reports, but skips post-action validation and misses edge cases that require cross-referencing multiple data sources. These are the multi-day projects where AI handles 70–80% of the steps but a human needs to verify the output before it ships.
Tasks fully solved (100% of required steps)
Level 1 Single-fix
Level 2 Multi-step
Level 3 Full-scope
Average step accuracy across tasks