Engineering our operations agent

June 27, 2025

At WorkWhile, shift fulfillment isn't just a logistics problem, it's a trust problem. Our customers rely on us to deliver quality workers to keep their business running smoothly. To do so, our small but mighty marketplace operations team was responsible for continuously monitoring thousands of shifts across multiple markets nationwide to ensure they're filled on time, with the right workers.

This blog outlines how we brought structured reasoning and AI to this workflow by building an internal operations agent that mirrors how the marketplace ops team thinks, but runs at machine scale. We'll walk through the problem, the architectural design, and the collaborative process of iteratively aligning AI with how real people do the job.

At WorkWhile, we've consistently delivered 90%+ fill rates for our customers, but that level of performance requires constant vigilance. When something slips, we need to know why and act fast

Why analyzing shifts is hard

It's not just about metrics. It's about correctly interpreting what they mean.

Why did this shift underperform? Why are fewer workers signing up? These aren't questions you can answer from looking at a spreadsheet; they require operational insight and judgement

Root causes are almost always entangled

Maybe a shift is posted at an odd hour in a low-liquidity market. Or maybe it requires a specific certification, and the qualified worker pool is small. Or the timing and incentives didn't align. Or maybe it's all of the above
You can't fix what you don't understand. And you can't understand what you're not trained to interpret

Analysis is skilled, manual, and constant

The marketplace ops team reviews shift performance by inspecting each phase of a shift's lifecycle - past, ongoing, or upcoming and addressing sub-milestones like progressing through qualification tiers, etc. using our state of the art ML models
It's a full-time job to keep up with this across multiple regions and customers

On average, we spend dozens of hours per week, per person monitoring fill performance. This presented the perfect opportunity to create an agent to replicate this investigative work with high accuracy, so humans can spend time reviewing rather than investigating.

Shift analysis process

Designing an agent that can think like ops

Start with the job, not the tool.

Our first step was aligning on the role of the agent: it's not there to make decisions but to mirror how a marketplace ops team member thinks about a shift. Investigation becomes analysis; analysis becomes insight.
In the traditional process, a human is responsible for investigating performance issues. With the agent, that human shifts into a reviewer role, inspecting the AI's reasoning instead of manually gathering and assessing all the raw data themselves.

We modeled the agent after how ops think

We conducted interviews, shadowed the marketplace ops team, and mapped out workflows. We then broke down the typical analysis flow into discrete steps and mapped each one directly to the kinds of questions marketplace ops asks when investigating a shift
Where is this shift in the lifecycle?
- tool: get_shift_lifecycle
- Determine if the shift is in the past, happening now, or in the future
What actually happened with this shift?
- tool: get_shift_data
- Check worker activity (scheduled, started, bailed, finished, etc.)
- Look at pay and bonuses and their impact on schedule rate
- Review historical log of worker activity on the shift
Did the time of the day make this shift hard to fill?
- tool: analyze_time_phase
- Classify shifts as overnight, early morning, or awkward so the agent can reason about time. Humans know intuitively that 3 AM shifts are harder to fill than 11 AM ones, but that intuition needs to be encoded for an agent
When and how was this shift made visible to different pools of workers?
- tool: get_worker_pool_stats
- This is one of the most nuanced signals. The agent investigates potential factors such as the initial eligible pool being too small or waiting too long to open the shift up to more workers.
- It's how the agent reasons about exposure and worker interest vs. timing
How does this shift compare to others like it?
- tool: get_similar_past_shifts
- Humans build intuition over time. The agent replicates that by pulling in past shifts with similar roles, times, and markets, and checking for pattern divergence

Each step of the agent's workflow is grounded in existing data and tools, ensuring it could reason similarly to a real person, but faster, more consistently, and at scale.

Ops agent flow

Architecting the agent system

Once we had the reasoning framework, the next question was: where should the agent live?

The answer was obvious once we mapped the agent's required data access to our existing ops tooling. Shift data and worker supply pool metrics already lived in our backend. So we built the agent to live there too.

This had three major advantages:

Safe + structured tooling
We “blessed” a small set of backend functions the agent could call. Each one wrapped a safe, observable, permissioned query over internal shift data. This meant the agent wasn't generating raw SQL, rather it used existing, reusable, library functions that made precise database calls. This ensured that data was high quality and significantly accelerated our development velocity
Easy to evolve with great access to our data
Because tools are just functions, adding new capabilities (e.g. querying historical fill patterns, worker supply data, etc.) is as simple as exposing a new handler. There is no need to rework the LLM or write external glue code
Streaming LLM outputs
We architected the system so the agent's reasoning can stream live through the UI as it is generated. This keeps marketplace ops in the loop – they can watch the agent “think”, see when it's reviewing the data or digging into past shifts, and understand where conclusions are coming from

The backend agent stack looks like this:

A request comes from the WorkWhile intelligence UI layer
It hits our internal controller, which invokes the agent
The agent has system prompts and tools defined using Pydantic AI
The LLM calls tools as needed
The stream of reasoning is returned directly to the UI in real-time

Defining roles: human and agent

From a workflow perspective, we made one thing very explicit: the agent is an analyst, not a decision-maker (yet). It produces reasoned reports. Humans remain in charge of action.

This clean boundary helped adoption internally:

Marketplace Ops team still owns the outcomes
The agent helps them automate data collection and reason faster, with more consistent depth
It's a collaboration, not a competition

ops agent

The social side of prompt refinement

Building the agent turned out to be half engineering, half anthropology.

Our system prompt didn't emerge from a vacuum. It came from:

Watching how marketplace ops works
Seeing how they phrase questions
Incorporating metrics they trust
Replicating patterns they look for

Prompt design became a cross-functional loop between engineering, product, and marketplace ops

Ship a version of the prompt
Watch how the agent reasoned
Review it with PMs and marketplace ops
Refine

It took us over 200+ iterations to get to the final system prompt. And we're still improving.

The final system prompt is more than just code. It reflects our understanding of:

Operational heuristics that actually matter
The mental models our teams use
The trust boundaries between human and AI

By grounding AI in human workflows, surfacing reasoning in real time, and keeping humans in control, we've built an agent that has become an extension of how our teams reason, not a replacement.

If bridging AI with real operational work at scale resonates, we'd love to work with you.