A Suite of Google Ads Optimization Tasks
We benchmark AI agents on real Google Ads optimization workflows. 21 tasks across campaign builds, account audits, bid strategy, and performance diagnosis — each graded step-by-step against expert-sourced rubrics.
Difficulty levels
3Three levels: single-fix, multi-step, and full workflows.
L1 — Single-fix tasks. One problem, one diagnosis, one action (~5 steps).
L2 — Multi-step workflows with dependencies between findings (5–9 steps).
L3 — Full-scope work: account audits, campaign builds, the kind of task that takes a marketer hours (25+ steps).
Judge
JAutonomous LLM grading — each step verified against preloaded correct answers.
Grading is fully autonomous — an LLM with the preloaded correct answer verifies each step against the agent's transcript.
Every step is either required (objectively necessary — broken tracking, wrong settings) or bonus (expert insight — dayparting, naming conventions). Models aren't penalized for taking a different valid approach on bonus steps.
Realistic environment
ESimulated Google Ads account with 28 tools. No hints, just a goal.
Agents interact with a simulated Google Ads account through 28 tools — the same data and actions a real practitioner would use. No multiple choice. No hints. Just a goal and an API.
Model performance overview
Percentage of tasks where the model scored 100% on all required steps.
Results analysis
Models handle isolated diagnostics well — most L1 tasks are solved perfectly. But as task complexity scales, performance drops significantly. On Level 3 workflows that mirror a marketer's full job — multi-hour audits, building accounts from scratch — no model fully automates the work end-to-end. They can reliably handle a meaningful portion of the workflow, but still require human oversight on the highest-judgment steps.
Tasks fully solved (100% of required steps)
Level 1 Single-fix
Level 2 Multi-step workflows
Level 3 Full-scope work
Average step accuracy across tasks
Level 1 Single-fix
Level 2 Multi-step workflows
Level 3 Full-scope work
Single-fix tasks
One problem, one diagnosis, one action. The kind of thing a marketer knocks out in 15 minutes — but it still requires reading the right data and making the right call.
Fix Conversion Tracking
Sonnet: 5/5 (100%)Must retrieve conversion data before any diagnosis is possible.
1 pt · RequiredThe conversion tag is on /contact instead of the purchase confirmation page, overcounting conversions. This is the core fix.
2 pts · RequiredCurrent 7-day window misses conversions that take 11+ days to close, underreporting true performance.
2 pts · RequiredFix Conversion Counting
Sonnet: 6/6 (100%)Retrieve conversion setup to inspect counting model.
1 pt · RequiredEach page reload counts as a separate conversion, inflating numbers.
2 pts · RequiredOnly count one conversion per click to get accurate purchase data.
2 pts · RequiredMust articulate how 'every' counting causes inflation — not just identify it.
1 pt · RequiredAdd Negative Keywords
Sonnet: 8/8 (100%)Need to see what queries are actually triggering ads before adding negatives.
1 pt · RequiredMust find >=3 of: informational, DIY, job-seeking, or post-purchase queries wasting spend.
2 pts · RequiredNegatives must actually be added to the account, not just recommended.
1 pt · RequiredAt least 4 of 6 wasted query categories must be blocked by the negatives chosen.
2 pts · RequiredMust not block commercial terms like "furniture" or "sofa" — only irrelevant traffic.
2 pts · RequiredDiagnose Bid Cap Throttling
Sonnet: 6/6 (100%)Need impression share data to see where delivery is being lost.
1 pt · RequiredNeed to see the bid cap setting that's causing the throttle.
1 pt · Required~75% IS lost to rank, not budget — the bid cap is preventing competitive bids.
2 pts · RequiredThe $1.20 bid cap must be lifted to let the bidding strategy compete in auctions.
2 pts · RequiredDiagnose tROAS Throttling
Sonnet: 6/6 (100%)Need to see actual ROAS vs target to diagnose the gap.
1 pt · RequiredRetrieve the tROAS target setting to compare against reality.
1 pt · RequiredTarget is 1500% (15x) vs actual ~380% (3.8x) — a 4x gap that throttles all delivery.
2 pts · RequiredMust bring target into realistic range so Smart Bidding can actually compete in auctions.
2 pts · RequiredPause Poor Ads & Create RSAs
Sonnet: 8/8 (100%)Need to review ad quality across the account, not just one group.
1 pt · Requiredad_009 and ad_011 have "Poor" ad strength — must be specifically named.
2 pts · RequiredBoth ad_009 and ad_011 must be paused, not just one.
2 pts · RequiredCan't just pause ads — need replacements to maintain ad coverage.
1 pt · RequiredReplacement RSAs must actually improve on the generic copy that made the originals "Poor."
2 pts · RequiredReallocate Budget
Sonnet: 8/8 (100%)Need comparative data across campaigns to identify the reallocation opportunity.
1 pt · Required54% IS lost to budget with 3.82 ROAS — best campaign is leaving money on the table.
2 pts · Required2.09 ROAS with $8,600/month — lowest-performing campaign with the most budget to give.
2 pts · RequiredMust actually move budget from camp_002 to camp_001, not just recommend it.
3 pts · RequiredMulti-step workflows
Findings depend on each other. The agent has to chain diagnostics — fixing one thing reveals the next problem, like a real account review. 5–9 steps.
tCPA Migration Readiness
Sonnet: 10/10 (100%)Must verify conversion tracking is healthy before migrating to Smart Bidding.
1 pt · RequiredtCPA will optimize on wrong data if the tag is on the wrong page.
2 pts · RequiredNeed conversion volume and CPA data to set an appropriate target.
1 pt · RequiredSufficient volume + fixed tracking = green light for migration.
2 pts · RequiredMust actually make the change, not just recommend it.
2 pts · RequiredSetting target too far from reality causes learning failures or overspend.
2 pts · RequiredtCPA Readiness (Low Volume)
Sonnet: 4/7 (57%)First check — is the conversion setup valid?
1 pt · RequiredNeed to see conversion volume to assess tCPA readiness.
1 pt · RequiredOnly 8 conversions in 90 days — far below the threshold for Smart Bidding to learn.
2 pts · RequiredShould suggest maximize_conversions or deferred migration with justification.
2 pts · RequiredThe correct answer is to not migrate. Setting tCPA on low volume will destabilize the campaign.
1 pt · RequiredFix Quality Score via Ad Relevance
Sonnet: 9/10 (90%)Need Quality Score breakdown to find the underperforming ad groups.
1 pt · Requiredag_010 and/or ag_012 have ad relevance dragging down Quality Score.
2 pts · RequiredNeed to see the current ad copy to understand the mismatch.
1 pt · RequiredKeywords are category-specific but ads use generic copy — that's the root cause.
2 pts · RequiredMust take action, not just diagnose.
1 pt · RequiredBedroom/kitchen-specific copy that matches the ad group's keyword theme.
3 pts · RequiredKeyword Expansion
Sonnet: 8/8 (100%)Search terms reveal converting queries with no matching keyword.
1 pt · RequiredConverting search terms for accent chairs have no dedicated keywords.
2 pts · RequiredConfirm the gap — verify no existing accent chair keywords.
1 pt · RequiredMust add the keywords to the account, not just recommend them.
2 pts · RequiredMust be specific — not generic "furniture" keywords that would waste spend.
2 pts · RequiredDiagnose Performance Drop
Sonnet: 5/8 (63%)Performance drops require checking what changed — this is the first diagnostic step.
1 pt · RequiredNeed the performance timeline to correlate with changes.
1 pt · Required$18 → $28 → $18 in ~22 days — constant target changes prevent Smart Bidding from learning.
2 pts · RequiredMust explain the mechanism: target changes → learning reset → limited_learning → poor delivery.
2 pts · RequiredFix requires committing to a stable target and letting the algorithm learn.
2 pts · RequiredPerf Drop (Budget Cut)
Sonnet: 2/7 (29%)Need to see what changed to cause the performance drop.
1 pt · RequiredCorrelate the performance timeline with the budget change.
1 pt · RequiredBudget went from $2,000 to $600 — this is the cause, not a learning reset.
2 pts · RequiredMust increase budget back or reallocate from another campaign.
2 pts · RequiredBudget cut → fewer auctions → IS_lost_budget → CPA spike from lost efficiency.
1 pt · RequiredLaunch New Campaign
Sonnet: 15/15 (100%)Verify conversion tracking works before building a campaign around it.
1 pt · RequiredFind the unserved converting query theme to build around.
1 pt · RequiredConverting queries with no matching keywords = campaign opportunity.
2 pts · RequiredMust build the actual campaign structure in the account.
2 pts · RequiredCampaign needs at least one ad group to serve.
2 pts · RequiredKeywords must match the identified theme.
2 pts · RequiredNeed at least one RSA to serve in the new ad group.
1 pt · Requiredmanual_cpc or max_conversions — not tCPA/tROAS with no historical data.
2 pts · RequiredAd copy must match the keyword theme, not be generic.
2 pts · RequiredFix Attribution + Bid Cap
Sonnet: 11/12 (92%)Need attribution settings to diagnose the window mismatch.
1 pt · RequiredNeed actual conversion lag data to compare against the attribution window.
1 pt · Required7-day window vs 11.2-day average lag = systematically undercounting conversions.
3 pts · RequiredMust be wide enough to capture the full conversion cycle.
2 pts · RequiredNeed impression share data to diagnose the bid cap issue.
1 pt · Required$1.20 bid cap with 75% IS lost to rank = bid cap is the bottleneck.
2 pts · RequiredFree the bidding strategy to compete in auctions.
2 pts · RequiredFix Retargeting Audience Mode
Sonnet: 10/10 (100%)Need to inspect audience configuration on the retargeting campaign.
1 pt · RequiredA dedicated retargeting campaign in observation mode serves to everyone — defeating the purpose.
2 pts · RequiredAt least 2 of 3 retargeting audiences must be switched.
2 pts · RequiredTargeting mode changes delivery + bid_modifier=0 means no bid adjustment.
2 pts · BonusCheck performance data to assess impact of the misconfiguration.
1 pt · RequiredPast performance was measured on untargeted traffic — can't use it as a retargeting benchmark.
2 pts · BonusFix Brand Leakage
Sonnet: 10/10 (100%)See what queries are triggering ads in the non-brand campaign.
1 pt · RequiredBrand searches appearing in non-brand campaign = leakage that skews metrics.
2 pts · RequiredCross-campaign analysis to understand the full scope of leakage.
1 pt · BonusBrand traffic inflates non-brand ROAS and distorts attribution between campaigns.
2 pts · RequiredMust block brand terms from triggering non-brand ads.
2 pts · RequiredBlock "brightnest" not "furniture" — precision matters.
2 pts · RequiredFull-scope work
The tasks that take a marketer hours — full account audits, building an account from scratch, Shopping and Performance Max deep dives. 25+ graded steps each.
Shopping Campaign Audit
Sonnet: 34/46 (74%) · Opus: 30/46 (65%)Shopping campaigns start with the feed — must check Merchant Center first.
1 pt · Required67 price mismatches and feed 6 days stale — products can't serve.
2 pts · RequiredNeed to see product-level data quality issues.
1 pt · Required42% missing GTINs and 25% generic titles hurt match quality and eligibility.
2 pts · RequiredGeneric titles → poor query matching → low CTR → wasted spend.
2 pts · BonusCheck how products are segmented for bidding.
1 pt · RequiredUndifferentiated group = same bid for high-margin and low-margin products.
3 pts · RequiredMust differentiate bids by performance tier.
2 pts · RequiredHigher-margin products get higher bids.
2 pts · RequiredCheck what queries are triggering Shopping ads.
1 pt · Required>= 2 types of wasted query traffic in Shopping.
2 pts · Required>= 3 negative keywords targeting Shopping-specific wasted terms.
2 pts · RequiredNeed performance metrics to assess bidding strategy.
1 pt · RequiredTarget 3.0x vs actual 1.78x — target is throttling delivery.
2 pts · Required48% IS lost to rank vs 7% to budget — bidding issue, not budget.
2 pts · BonusLower tROAS or change strategy to improve delivery.
2 pts · RequiredCheck if retargeting is layered on Shopping.
1 pt · RequiredNo retargeting on Shopping — missing high-intent repeat visitors.
2 pts · BonusCheck Search campaign terms to find overlap with Shopping.
1 pt · BonusBrand queries appearing in both Shopping and Search.
2 pts · BonusVerify conversion setup is correct for Shopping.
1 pt · RequiredCorrect ordering: fix the feed before tuning bids.
2 pts · BonusMerchant Center vs Google Ads — different systems, different fix paths.
2 pts · BonusBroken feed → wasted bids → bad IS → aggressive target. 3+ issues chained.
3 pts · BonusAfter fixes, the old ROAS target is meaningless — needs a new baseline.
2 pts · RequiredFeed-as-ad model creates failure modes that don't exist in Search.
2 pts · BonusBuild Account from Scratch
Sonnet: 21/46 (46%) · Opus: 41/46 (89%)Account needs conversion tracking before anything else.
2 pts · RequiredCorrect conversion settings from the start.
2 pts · BonusResearch keyword opportunities before building campaigns.
1 pt · RequiredDiversified keyword strategy with phrase/exact match.
2 pts · RequiredBroad match at launch wastes budget on an unproven account.
2 pts · RequiredBrand traffic needs its own campaign for attribution clarity.
2 pts · RequiredAt least one campaign targeting category/product terms.
2 pts · RequiredAttribution isolation rationale — brand inflates non-brand metrics if mixed.
2 pts · Bonusmaximize_clicks or max_conversions — no targets without conversion history.
2 pts · RequiredCampaign structure needs themed ad groups.
1 pt · RequiredKeywords added to the account with proper match types.
1 pt · RequiredProactive negatives at launch to prevent wasted spend.
2 pts · RequiredRSA for the brand ad group.
2 pts · RequiredBoth brand and non-brand need ad coverage.
2 pts · RequiredBrand RSAs have brand name; non-brand are category-specific.
2 pts · RequiredExtensions improve ad real estate and CTR.
1 pt · RequiredAdditional extension types for comprehensive coverage.
1 pt · RequiredWhy no bid modifier on audiences at launch.
2 pts · BonusNo tCPA/tROAS justified by zero conversion history.
2 pts · BonusBrand vs non-brand split justified, ~$8K total.
2 pts · RequiredRecommend "2-3 conversions in 30-45 days" before switching.
2 pts · BonusDouble-check conversion tracking works at the end.
1 pt · BonusNo theme mismatch between keywords and ad copy.
2 pts · RequiredStructural hygiene — no keyword overlap between ad groups.
2 pts · RequiredReport covers campaigns, ad groups, budgets, bid strategy.
2 pts · RequiredPhased timeline for moving from launch to optimization.
2 pts · BonusPerformance Max Audit
Sonnet: 43/48 (90%)Start by understanding the PMax campaign configuration.
1 pt · RequiredVerify conversion tracking for PMax optimization.
1 pt · RequiredInspect PMax asset group configuration.
1 pt · Requiredast_003, ast_004, ast_007 — specific assets dragging performance.
2 pts · RequiredGet detailed performance data for assets.
1 pt · RequiredNo video = can't serve on YouTube/Discover — losing cross-channel reach.
2 pts · RequiredOnly 2 descriptions vs 4 recommended — limits ad variation.
2 pts · BonusCheck audience signal configuration.
1 pt · RequiredBoth first-party signal types are absent.
2 pts · RequiredSignals = starting point for PMax's targeting, not a hard filter.
3 pts · RequiredTake action to improve signal quality.
2 pts · RequiredPMax-specific reporting data.
1 pt · Required18% of impressions, 31% of conversions — brand inflating PMax metrics.
2 pts · RequiredPMax stealing brand conversions from Search, distorting attribution.
2 pts · RequiredExclude brand terms from PMax to fix attribution.
2 pts · Requiredfinal_url_expansion=true sends traffic to wrong landing pages.
2 pts · RequiredMust actually turn it off.
2 pts · RequiredWrong landing page → CVR drop → wasted spend.
2 pts · RequiredCheck PMax campaign performance metrics.
1 pt · Required2.09x actual vs 3.0x target — quantify the gap.
2 pts · RequiredPMax serving on display-heavy channels without video assets.
2 pts · BonusBrand cannibalization + URL expansion + weak signals interact and compound.
3 pts · BonusCorrect fix ordering among all the issues found.
2 pts · BonusLess control = more upstream configuration required.
2 pts · BonusReport covers 4+ issue categories with supporting evidence.
2 pts · Required30-day re-baseline period after implementing fixes.
2 pts · BonusAgent should track findings as it goes — evidence of structured work.
1 pt · RequiredFull Account Audit
Sonnet: 52/80 (65%) · Opus: 55/80 (69%)Phase 1 — Tracking (6 steps, 12 pts)
Tag on /contact overcounts purchases — name the consequence.
2 pts · RequiredMove conversion tracking to the correct page.
2 pts · RequiredMust use >= 30 day date range to see the full picture.
2 pts · Required7-day window vs 11.2-day lag = missing conversions.
2 pts · RequiredWiden window to capture full conversion cycle.
2 pts · RequiredPMax/tCPA have been optimizing on wrong data for 8 months.
2 pts · RequiredPhase 2 — Campaign Settings (12 steps, 24 pts)
getCampaignDetails for camp_001.
2 pts · Requiredcamp_003 is in limited_learning status.
2 pts · Requiredcamp_003 performance data has a quality flag — must acknowledge it.
2 pts · RequiredBest campaign is budget-constrained.
2 pts · RequiredPMax underperforming vs Brand Search.
2 pts · RequiredBoth IS loss and ROAS gap must be named as joint cause.
2 pts · RequiredMove $1K-$4K from camp_002 to camp_001.
2 pts · Required~9 conv/month — not enough for Smart Bidding.
2 pts · RequiredAppropriate for low-volume campaign.
2 pts · BonusFor dayparting analysis.
2 pts · BonusInterpret hourly performance patterns.
2 pts · BonusSpecific, data-backed schedule change.
2 pts · BonusPhase 3 — Ad Group & Ad Review (4 steps, 8 pts)
Brand Search ad group has 0 RSAs.
2 pts · RequiredCheck >= 2 non-brand campaigns for coverage.
2 pts · RequiredAd copy quality issue — headlines are too generic.
2 pts · RequiredNew RSA on ag_001 with brand-relevant copy.
2 pts · RequiredPhase 4 — Keyword Review (4 steps, 8 pts)
Brand keywords are all broad — no exact or phrase match.
2 pts · RequiredBroad match on brand terms bleeds into non-brand queries.
2 pts · RequiredMust add tighter match types to ag_001.
2 pts · RequiredBroad match inflates CPCs on a budget-constrained campaign.
2 pts · RequiredPhase 5 — Audience Review (4 steps, 8 pts)
camp_005 retargeting audiences are in observation mode.
2 pts · RequiredObservation mode means ads serve to everyone, not just retargeting list.
2 pts · Requiredaud_001, aud_002, aud_003 must all be updated.
2 pts · RequiredHistorical metrics are meaningless — they measured untargeted traffic.
2 pts · RequiredPhase 6 — Extensions Review (3 steps, 6 pts)
Account-level extensions are sparse.
2 pts · RequiredRotation mechanism and ad real estate benefits.
2 pts · BonusNew callout extensions at account level.
2 pts · RequiredPhase 7 — Change History (2 steps, 4 pts)
Scope getChangeHistory to camp_003.
2 pts · Required$28→$18 caused learning reset. Must explain increment guideline.
2 pts · RequiredPhase 8 — Naming Convention (2 steps, 4 pts)
Campaign names follow no convention.
2 pts · BonusApply a consistent naming convention.
2 pts · BonusPhase 9 — Report Quality (3 steps, 6 pts)
Completeness — nothing major missed.
2 pts · RequiredAfter tracking/attribution fixes, historical data needs re-baselining.
2 pts · RequiredReport must be internally consistent — no made-up issues or fixes.
2 pts · Required