A Suite of Influencer Marketing Tasks

We benchmark AI agents on real influencer marketing workflows — creator discovery, audience vetting, competitive research, shortlist building, and content brief writing. 14 tasks across 3 difficulty levels, each graded step-by-step against expert rubrics.

Difficulty levels

Three levels: single-skill, multi-phase, and full campaign pipelines.

L1 — Single-skill tasks. One focus area — discovery, vetting, pricing, or research (~3-5 steps).

L2 — Multi-phase workflows that chain skills together — compare creators, build optimized shortlists, write grounded briefs (5-9 steps).

L3 — Full campaign pipelines: competitive research → discovery → vetting → shortlist → content briefs. The kind of work that takes a strategist days (19-22 steps).

Hybrid judge

Programmatic checks for tool usage + LLM rubrics for strategic quality.

Programmatic checks verify the agent called the right tools — demographics checked, videos watched, shortlist built within budget.

LLM-graded steps evaluate strategic quality — did the report identify fake engagement? Are content briefs grounded in research and tailored per creator? Each uses ground-truth archetypes hidden from the agent.

Realistic environment

Simulated TikTok ecosystem with 120 creators, 1,200 videos, and 21 tools.

Agents work on ForgeFit — a home workout app launching its first influencer campaign. 120 creators with hidden archetypes (good fit, bought followers, bot comments, wrong demographics), 1,200 videos, 250 competitor ads, and an adaptive feed that improves with relevant behavior. No hints. Just a brief and tools.

Explore the environment

Model performance overview

Average step accuracy across all tested tasks.

Sonnet 4.6 Anthropic

88%

GPT-5.3 OpenAI

70%

Opus 4.6 Anthropic

Pending

Gemini 3 Pro Google

Pending

Grok 4 xAI

Pending

Results analysis

Influencer marketing sits at the intersection of data gathering and strategic judgment. The benchmark reveals which parts of this workflow are ready for automation and which still need a human in the loop.

Automate: Creator discovery, pricing checks, audience demographics, and competitor ad research. AI reliably searches, filters, and compiles structured data across dozens of creators in minutes — the kind of data-pull work that takes a strategist hours of manual platform browsing. Single-skill tasks score 80-100% across models.

Co-pilot: Vetting for authenticity, building budget-optimized shortlists, and writing creator-specific content briefs. AI handles 70-85% of multi-phase workflows but consistently misses fake engagement signals — bought followers, bot comments, and engagement pods slip through even when the agent is explicitly asked to look for them. Both models let at least one fraudulent creator onto their L3 campaign shortlists. These are the decisions where human review catches what AI misses before money is committed.

Not yet: End-to-end campaign execution without oversight. Full pipeline tasks (L3) require thorough vetting of 15-20+ creators, staying within budget, and producing research-grounded briefs — a level of sustained thoroughness and cross-referencing where current models drop to 44-75% accuracy. The gap between "finds the right creators" and "builds a campaign you'd actually ship" is where human strategists still earn their keep.

Average step accuracy by level

Level 1 Single-skill

Sonnet 4.6

100%

GPT-5.3

80%

Level 2 Multi-phase

Sonnet 4.6

86%

GPT-5.3

76%

Level 3 Full pipeline

Sonnet 4.6

75%

GPT-5.3

44%