Testing — CQFE
Before Your Agent Touches a Real Job
Video Lesson Coming Soon
A video walkthrough for this module is in production. For now, dive into the written content below.
What You'll Learn
- ✓ The 5-Test Protocol
- ✓ Test 1: Standard (baseline)
- ✓ Test 2: Minimal (escalation)
- ✓ Test 3: Detailed (instruction-following)
- ✓ Test 4: Edge case (scope)
- ✓ Test 5: Adversarial (stress test)
- ✓ The CQFE Rubric
- ✓ The Prompt Improvement Log
- ✓ Go/No-Go decision framework
In this module 12 sections
Picture This: Your Agent Just Failed Its First Real Test
Testing is not optional. It is the difference between an agent that is ready to go live and one that will embarrass you in front of a client.
A bad output reaching a client is not just an embarrassment — it is a refund request, a negative review, and lost future work. Testing is your insurance policy.
The 5-Test Protocol
Walk through the complete quality assurance protocol.
Interactive — tap to explore
You will run five tests on your agent before going live: the Standard Brief, the Minimal Brief, the Detailed Brief, the Edge Case, and the Adversarial Brief.
Test 1: The Standard Brief
A clear, complete, typical brief. No tricks, no edge cases, no missing information. This is the kind of request that represents the core 80 percent of the work your agent will handle.
This is your baseline test. Provide a normal project brief with standard requirements, a couple of attached documents, and a one-week deadline.
Test 2: The Minimal Brief
A brief with the bare minimum of information. Vague, short, open to interpretation. This is what real clients sometimes send — two sentences and an expectation that you will figure out the rest.
What you are testing: Does your agent know when to ask for clarification? Or does it guess and produce something that might be completely off-target? What to look for: The agent should trigger its escalation behaviour. It should identify what is missing, summarise what it does understand, and ask specific clarifying questions before proceeding.
If it produces a full output from a vague brief without asking anything, your escalation section needs work. Example minimal brief: Can you write something about our new product launch? Thanks. No product details. No audience. No tone guidance. No word count.
A well-trained agent should recognise the gaps and ask. A poorly trained one will write 500 words of generic product launch content that helps no one.
Test 3: The Detailed Brief
This test gives the agent maximum information. Provide exhaustive detail — long briefs, multiple documents, complex requirements, edge cases mentioned upfront.
Test 4: The Edge Case
A brief that sits at the boundary of your agent's scope. It is almost in range, but requires a judgment call.
Maybe it is adjacent to your service but not quite what you offer. Maybe it involves a topic that is borderline sensitive. Maybe the client is asking for something that conflicts with one of your rules.
What you are testing: Does the agent apply scope boundaries appropriately? Or does it accept everything uncritically? The edge case test reveals whether your role section and escalation triggers are doing their jobs.
What to look for: The agent should either handle the request with appropriate caution and caveats, or it should escalate — explaining why this sits outside its scope and offering an alternative. What it should not do is pretend the edge case is a standard case and barrel ahead.
Example edge case (for a content writing agent): I need you to write a technical whitepaper on kubernetes container orchestration for our DevOps team. It should include architecture diagrams and code examples. About 3,000 words. If your agent's role is content writer for small business owners, non-technical audience, this brief is clearly outside scope.
The correct response is to flag that this requires deep technical expertise that falls outside the service, and suggest the client seek a technical writer.
Test 5: The Adversarial Brief
The stress test. A brief that is vague, contains conflicting requirements, or is missing critical information in a way that could lead the agent seriously astray. This is not about being unfair to your agent — it is about finding out what happens when things go wrong, because in real work, things will go wrong.
What you are testing: Does the full escalation protocol kick in? Does the agent handle confusion gracefully? Does it avoid producing confidently wrong output? What to look for: The agent should pause, acknowledge the problems, and ask for resolution before proceeding.
The worst possible outcome on this test is an agent that produces a full, confident output that is based on wrong assumptions. Example adversarial brief: Write a blog post about our product. Make it professional but also fun and edgy. Keep it under 300 words but make sure to cover all the features comprehensively. Don't include any technical details but do explain how the technology works. Urgent — need this ASAP.
Conflicting requirements (comprehensive but short, no technical details but explain the technology, professional but edgy). Missing information (no product specified, no audience, no features listed). Pressure language designed to push toward rushing.
A well-trained agent will identify the contradictions and missing information and ask for clarification. A poorly trained one will produce 300 words of vague, contradictory nonsense.
The Minimum Bar: 3-3-3-3
Before going live, your agent should score 3-3-3-3 on the standard brief (Test 1). That is the minimum. If any dimension is consistently below 3 on the straightforward case, fix it before testing the harder scenarios.
For Tests 2-5, the scoring is more nuanced. A perfect score on the adversarial test is less important than a perfect score on the standard test — but if your agent scores 1 on Escalation for the minimal brief, that is a serious problem that needs fixing before it encounters a real vague client request.
Red Flags: What Bad Outputs Are Telling You
When test outputs are not right, the specific way they fail tells you exactly what to fix.
Generic outputs that could have been written for anyone indicate your examples are insufficient or your role section is too vague. The agent is defaulting to average because it has no clear picture of what your specific version of good looks like. Fix: strengthen your role definition and add more specific, higher-quality examples.
Inconsistent quality across similar briefs means your process instructions are not specific enough. The agent is improvising a different approach each time instead of following a reliable method. Fix: add more explicit step-by-step guidance to your Instructions section.
Outputs that miss specific requirements from the brief indicate the agent is not reading the brief carefully enough, or it is losing track of requirements when there are many. Fix: add a stronger intake step — Before starting, confirm you have and will address a checklist of required elements.
Outputs that are consistently too long or too short mean your length guidance is unclear or missing. Appropriate length is not a standard. Fix: specify exact word counts or ranges — 600-800 words for a standard blog post, unless the brief specifies otherwise.
The agent never escalates, even on vague briefs, indicates your escalation triggers are missing or too vague. Fix: define specific conditions with specific responses. If the brief does not include X, ask for X before proceeding.
The agent escalates too often, even on clear briefs, means your escalation triggers are too broad. The agent is treating normal requests as ambiguous. Fix: tighten your escalation conditions. Make sure each trigger is specific to genuinely problematic situations, not general uncertainty.
Outputs that sound robotic or use banned words suggest your word bans or style rules are not in the system prompt, or the agent is not weighting them. Fix: check that your Dos and Don'ts are present and specific. If you banned delve and it still appears, make the rule more prominent — sometimes moving rules higher in the prompt increases their weight.
The Prompt Improvement Log
Every test teaches you something. The improvement log captures those lessons so they compound over time instead of being forgotten.
It takes about five minutes per test session to maintain. That investment pays for itself many times over.
Using the Log: Diagnosis and Patterns
Notice the diagnosis in that example. The problem was not that the rules were wrong — they explicitly banned the phrase. The problem was that banning something without showing an alternative left the agent guessing. The fix was not to add more rules. It was to add a better example.
This kind of precision is what separates systematic improvement from random tinkering. The log forces you to diagnose specifically, act precisely, and verify that the fix actually worked.
After ten or twenty entries, patterns emerge. Maybe your Quality score is consistently lower than your Completeness score — which tells you the agent follows instructions but produces mediocre writing. That is an examples problem, not a rules problem. Maybe your Escalation score drops on every edge case — which means your escalation triggers are too narrow and do not cover the boundary situations your service encounters. These patterns are invisible without a log. With a log, they become obvious and actionable.
The Go/No-Go Decision
Interactive: Readiness Assessment for Going Live
Interactive — tap to explore