Back to works

Replacing the research pile with a briefing agent

Turning a 4 hour research scramble into an AI-native workflow

I had wanted to build this for District 3 since the commercial AI wave kicked off in mid-2022. The admissions workflow was exactly the kind of process that made me think "this wouldn't exist like this if built today," but I never had the right opening to pursue it. Wealthsimple's AI Builder prompt finally gave me one.

My RoleDesigner & Developer
StackClaude Agent SDK, FastAPI, HTMX
Built ForWealthsimple AI Builder Application
Cost Per Brief~$2.50 CAD
Time Per Brief~18 minutes (vs. 3–4 hours manual)
Opening page of a generated Briefbot admissions brief shown as a document screenshot

output brief

The Bottleneck

District 3 is a publicly funded startup incubator at Concordia University in Montreal, processing roughly 200 to 250 applications per month. When an application lands, an operations team member spends one to four hours on background research: scouring LinkedIn, checking founder websites, mapping the competitive landscape, looking for patents and press. They write hasty bullet-point notes and assign a program stream. Stream leads walk into panel meetings and spend their first ten minutes just orienting themselves.

The research is the bottleneck. It's tedious and inconsistent. Two people reviewing the same application can produce dramatically different notes. The quality of the research dictates the quality of the panel's decision, yet it's treated as rote admin work.

This is the kind of process that evolved before modern AI. It wouldn't exist like this if built today.

Grid of post-it notes explaining District 3's structure, mandate, and operations

understanding D3

Grid of post-it notes cataloguing issues with the current admissions research system

what's broken

Grid of five hexagons showing wants and fears for each stakeholder group: Wealthsimple, Stream and Program Leads, Startups and Founders, Ops Staff, and D3 As an Org

stakeholder mapping

Venn diagram showing what AI can address versus what must remain human, with the current system issues mapped across both zones

drawing the boundary

Tall swimlane diagram of the current pre-AI admissions research process showing handoffs, delays, and manual steps

the existing process

Before

  • Application sits for days
  • 1–4 hours of manual research
  • Hasty, inconsistent bullet-point notes
  • Stream leads walk in cold
  • Quality depends on who's on shift

After

  • Application triggers agent immediately
  • ~18 minutes, ~$2.50 CAD in API costs
  • 9-section cited brief with risk flags
  • Stream leads arrive informed
  • Consistent quality, every time

What I Built

I built an AI system that eliminates the research bottleneck. When a new application is submitted, an agent autonomously generates a research brief: founder profiles with verified sources, competitive analysis, SDG alignment assessment, stream classification with reasoning, a scored evaluation rubric, risk flags, and interview questions for both operations staff and panelists, all cited to real URLs or application fields.

The human can now open an application that's already deeply researched. Operations shifts from doing research to reviewing research. Stream leads arrive at panel meetings with context instead of spending their first ten minutes catching up.

The Brief: Nine Sections, All Cited

Every brief the agent produces contains nine sections, each with mandatory citations to URLs, application fields, or knowledge base documents:

SectionPurpose
SynthesisWhat the startup does, overall confidence score, and a plain-language recommendation
Founder ProfilesPer-founder background research with credibility signals and gaps
SDG CoherenceAssessment of whether claimed UN Sustainable Development Goals actually match the work
Competitive ContextComparable ventures, market landscape, and differentiation analysis
Evaluation ScorecardEach rubric criterion scored with justification and confidence level
Stream ClassificationBest-fit D3 program stream and stage with reasoning
Key RisksRed flags, gaps, and concerns ranked by severity
Questions for OpsGap-based questions for the operations team to investigate before interview
Questions for PanelistsEvaluation-based questions to probe during the interview itself

Design Philosophy: Sandbox, Not Script

Version 1 of this system was a scripted pipeline, a chain of prompts with predetermined steps and retry logic. It broke constantly. If a LinkedIn page was down or a founder didn't have a website, the whole chain derailed. Wrong output was worse than no output, because it created more work for the ops team to correct.

For version 2, I replaced choreography with agentic design. The agent isn't given a task list. It's given an environment: knowledge to reference, tools to use, a goal to achieve, and guidelines for quality. Within that environment, the agent decides what to research, in what order, and how deep to go.

Agent Sandbox as a diamond in the center with arrows pointing to four boxes: Knowledge, Tools, Guidelines, and Goal, each with detailed bullet points

the sandbox

Knowledge

A /knowledge folder the agent reads at the start of every run, containing D3's mandate, evaluation rubric, stream definitions, and SDG framework. This is what makes it D3's agent, not a generic research bot.

Tools

Thirteen tools the agent can call in any order, from web research with multi-strategy fallback to self-assessment checkpoints, brief section emitters, human review flags, mid-run human input requests, and working memory for notes and research plans.

Goal

Produce nine brief sections, all cited. The agent knows what "done" looks like but has full autonomy over how to get there.

Guidelines

A quality bar, not step-by-step instructions. Every factual claim needs a citation. Confidence thresholds determine whether to proceed, retry, or flag for human review. And critically: zero revisions is a sign of a first-pass report.

How the Agent Works

Tall swimlane diagram of the agentic research flow with a web research system breakdown as an offshoot, and lanes for the human observer and human reviewer

agentic flow

The agent operates in a loop of research, assessment, and output. It loads D3's knowledge base, reads the application, creates a research plan, then executes that plan phase by phase. After each phase, it self-assesses its confidence and decides whether to proceed, retry with different sources, or flag for human review.

Brief sections are emitted incrementally as the agent completes them, not batched at the end. This means the brief builds in real time, and observers can watch the research unfold through a live server-sent event stream.

The Backtracking Loop

The key design insight. After each self-assessment, the agent reviews all previously emitted sections and asks: "Does anything I've already published need updating in light of what I just learned?" If the answer is yes, it revises the section and logs the reason. This is how the system produces an honest, non-linear brief rather than a first-pass report.

One Real Run

The useful artifact is not hidden chain-of-thought. It's the audit trail: what the agent tried, what failed, when it asked for help, and what brief it ultimately produced.

live agent log
Loading full run log...

This is the literal run log: fetch failures, escalations, self-assessments, emitted sections, and final review flags.

Opening page of a generated Briefbot admissions brief shown as a document screenshot

brief preview

Nine cited sections, ranked flags, and a cleaner starting point for the ops team and panelists.

Download full brief PDFDownload raw log

Research Resilience: The Fetch Cascade

When the agent needs to research a URL, it doesn't just fetch and hope. It runs through a five-strategy cascade:

1

GitHub API

For github.com URLs, returns structured profile data: repos, stars, languages, bio

2

Direct HTTP Fetch

With browser headers, HTML extraction, meta tags, Next.js SSR/RSC data extraction, and structured navigation links

3

Agent-Driven Exploration

The tool returns navigation links from the page; the agent decides which sub-pages are relevant to its research question and fetches those

4

Jina Reader

Headless browser rendering for JavaScript-heavy SPAs that don't serve content in initial HTML

5

Wayback Machine

CDX API lookup for dead or blocked sites, fetches most recent archived snapshot

If all five strategies fail, the agent flags the gap honestly and moves on. It never fabricates. Even the research tool follows the sandbox pattern: the tool provides information (page content plus navigation links), the agent provides judgment (which links are worth following).

Real example from test run

LinkedIn returned a 429 rate-limit block for one founder. The agent fell back to the founder's personal website, ZoomInfo via web search, and GitHub API. It still assembled a comprehensive profile. Then it flagged the LinkedIn gap honestly in the brief so the ops team knew what wasn't checked.

Human-in-the-Loop: Two Touchpoints

The system has two distinct moments where humans interact with the agent's work, each designed for a different purpose.

During the Run: Phone-a-Friend

When the agent hits genuine ambiguity that would change its research direction, it can pause and ask a human observer a question in real time. The question appears in the live agent log, and the observer types a response. The agent wakes up and continues with the new context.

If no human responds within five minutes, the agent times out, flags the gap honestly, and proceeds without fabricating an answer.

After the Run: Reviewer Context

When the brief is complete, flagged items are ranked by severity at the top. A reviewer can provide corrections, missing context, or policy clarification through a form. A focused mini-agent then rewrites the affected sections incorporating the new information.

The brief evolves in place. Flags are marked "resolved," and the reviewer can continue adding context even after all flags are addressed.

Real example from test run

The agent discovered that one of the applicants appeared to be a current employee of District 3, the very incubator being applied to (Can you guess who it was researching? 😉). It found this by cross-referencing the founder's portfolio site, schema.org metadata, and ZoomInfo data. Rather than guessing how to treat this, it paused and asked the human observer: "Should I treat this as a conflict of interest, as neutral context, or as something else?" The observer clarified that the founder had left D3 in January. The agent updated its analysis accordingly.

Where AI Stops

The AI handles all research, analysis, cross-referencing, and structured output. It is explicitly not responsible for any decision or communication. It cannot accept or reject an applicant, send an email, or route a startup to a stream lead.

The tools for these actions don't exist. This isn't a policy restriction that could be overridden by clever prompting. It's an architectural boundary. The agent can't cross the line because the line is a wall.

AI is responsible for

  • All background research and fact-finding
  • Competitive landscape analysis
  • SDG alignment assessment
  • Stream classification with reasoning
  • Rubric scoring with justification
  • Risk identification and flagging
  • Generating interview questions
  • Self-assessing its own work quality
  • Honestly flagging what it can't resolve

Humans are responsible for

  • All accept/reject decisions
  • All outbound emails to founders
  • The pitch meeting and deliberation
  • Final stream assignment confirmation
  • Whether to act on flagged items
  • Reviewing and approving communication
Two boxes with grids of post-it notes: AI is responsible for on the left, Human is responsible for on the right

the wall

The Critical Decision That Must Remain Human

Accept/reject after the pitch meeting. This is where institutional judgment, founder rapport, and cohort composition mix in ways that can't be replicated by an AI, no matter how clever the prompting. The brief gets the panel 90% of the way there. The last 10% is theirs.

Emergent Behaviour: What Nobody Programmed

Because the agent has tools, knowledge, and the freedom to reason about what it finds, it catches things that no scripted system would. Two examples from the test run:

Undisclosed Insider Connection

The agent discovered that one founder was a former D3 employee by cross-referencing their portfolio site's schema.org metadata, ZoomInfo results, and LinkedIn activity, then flagged the undisclosed relationship. Nobody programmed a "check if the applicant works at D3" step. The agent found it because it had the right tools and context to reason about what it was seeing.

Project Dormancy Detection

The agent pieced together from both founders' profiles, GitHub commit history, LinkedIn project dates, DNS lookup failures, and App Store searches that the project was likely dormant despite claims of a "working prototype used daily." Five independent data points, cross-referenced into a single finding.

These are the kinds of findings that justify the agentic approach. A scripted pipeline would check predetermined sources in a predetermined order. The sandbox lets the agent follow leads, cross-reference across sources, and surface patterns that emerge from the data.

What Changed: V1 to V2

Super tall screenshot of the V1 scripted pipeline output with handwritten notes scribbled on it identifying problems

V1 post-mortem

Grid of post-it notes cataloguing all findings and lessons learned from the V1 scripted pipeline

V1 findings

V1: Scripted Pipeline

  • Gemini + Tavily, built with Codex
  • Linear steps with retry logic
  • Predetermined research order
  • Broke when reality didn't match the script
  • Wrong output worse than no output

V2: Sandbox Design

  • Claude Agent SDK, custom MCP server
  • Agent reasons about approach dynamically
  • Self-assessment after each research phase
  • Backtracking when new findings contradict earlier work
  • Human-in-the-loop at two touchpoints

Key V1 Lessons That Shaped V2

Wrong output is worse than no output

Fabrication creates more work than gaps. The system must be honest about what it doesn't know.

Quality matters more than task completion

The agent needs a target state, not just a task list. Finishing all steps poorly is worse than flagging three steps as unresolvable.

Self-review loops are essential

Assess sufficiency before proceeding. If confidence is low, retry. If still low, flag for human review.

Domain knowledge is the differentiator

The knowledge base is what separates a useful agent from a generic research bot. Without D3's rubric and stream definitions, the agent can't make meaningful assessments.

You can't scrape LinkedIn

V1 relied heavily on LinkedIn data. V2 asks applicants for more profile URLs upfront, then uses a multi-strategy fetch cascade to get data from whatever sources are available.

What Breaks First at Scale

Trust calibration. The brief is good enough that ops might rubber-stamp instead of reviewing. The self-assessment loops and human review flags resist this. The system tells you when it's uncertain, but the real safeguard is organisational. Keeping humans accountable for the decisions the system explicitly refuses to make.

Other scaling concerns include API cost management at 200+ briefs per month, rate-limiting from external sources like LinkedIn, and the need for a proper database as the file-based storage approach won't hold under concurrent load. But the trust calibration problem is the most interesting because it's not a technical problem. It's a human one. The system is designed to make human oversight easier, but it can't force people to actually exercise it.

Claude Agent SDK
Claude Sonnet 4.6
FastAPI + SSE
Real-time streaming backend
Vanilla JS + HTMX
Lightweight frontend
13 MCP Tools
Custom in-process server
~$2.50 CAD
Per brief in API costs
Docker
Deployed on Render

Why This Finally Got Built

I had been circling this idea since 2022, when it became obvious that a lot of legacy workflows were about to look embarrassingly pre-AI. District 3's admissions research process was one of them. High-volume, repetitive, cognitively messy, and still dependent on humans to stitch together context from scattered sources. It was exactly the kind of system I wanted to rebuild, but not the kind of internal project you can casually decide to spin up on your own.

The Wealthsimple application prompt was what actually made me build this; it asked for a real system, not a speculative deck, and it explicitly cared about where AI should take on responsibility and where it should stop. That was the kick in the pants I needed to finally build the thing properly.

What interested me most was the pattern, and how it can be applied to other use cases. This kind of high-volume review work with fragmented evidence, real operational pressure, and a decision boundary that still belongs to a human. That same pattern shows up in places like KYC and AML. I would not hand final judgment to the model, but I would trust it to assemble the case, surface inconsistencies, rank risk, and give a human reviewer a far better starting point with valuable time saved.