Engineering our operations agent
June 27, 2025
By Arjun Vegda & Sean Abraham
At WorkWhile, shift fulfillment isn't just a logistics problem, it's a trust problem. Our customers rely on us to deliver quality workers to keep their business running smoothly. To do so, our small but mighty marketplace operations team was responsible for continuously monitoring thousands of shifts across multiple markets nationwide to ensure they're filled on time, with the right workers.
This blog outlines how we brought structured reasoning and AI to this workflow by building an internal operations agent that mirrors how the marketplace ops team thinks, but runs at machine scale. We'll walk through the problem, the architectural design, and the collaborative process of iteratively aligning AI with how real people do the job.
At WorkWhile, we've consistently delivered 90%+ fill rates for our customers, but that level of performance requires constant vigilance. When something slips, we need to know why and act fast
Why analyzing shifts is hard
It's not just about metrics. It's about correctly interpreting what they mean.
- Why did this shift underperform? Why are fewer workers signing up? These aren't questions you can answer from looking at a spreadsheet; they require operational insight and judgement
Root causes are almost always entangled
- Maybe a shift is posted at an odd hour in a low-liquidity market. Or maybe it requires a specific certification, and the qualified worker pool is small. Or the timing and incentives didn't align. Or maybe it's all of the above
- You can't fix what you don’t understand. And you can't understand what you're not trained to interpret
Analysis is skilled, manual, and constant.
- The marketplace ops team reviews shift performance by inspecting each phrase of a shift's lifecycle - past, ongoing, or upcoming and addressing sub-milestones like progressing through qualification tiers, etc. using our state of the art ML models
- It's a full-time job to keep up with this across multiple regions and customers
On average, we spend dozens of hours per week, per person monitoring fill performance. This presented the perfect opportunity to create an agent to replicate this investigative work with high accuracy, so humans can spend time reviewing rather than investigating.
Designing an agent that can think like ops
Start with the job, not the tool.
- Our first step was aligning on the role of the agent: it’s not there to make decisions but to mirror how a marketplace ops team member thinks about a shift. Investigation becomes analysis; analysis becomes insight.
- In the traditional process, a human is responsible for investigating performance issues. With the agent, that human shifts into a reviewer role, inspecting the AI's reasoning instead of manually gathering and assessing all the raw data themselves.
We modeled the agent after how ops think
We conducted interviews, shadowed the marketplace ops team, and mapped out workflows. We then broke down the typical analysis flow into discrete steps and mapped each one directly to the kinds of questions marketplace ops asks when investigating a shift
Where is this shift in the lifecycle?
- tool:
get_shift_lifecycle
- Determine if the shift is in the past, happening now, or in the future
What actually happened with this shift?
- tool:
get_shift_data
- Check worker activity (scheduled, started, bailed, finished, etc.)
- Look at pay and bonuses and their impact on schedule rate
- Review historical log of worker activity on the shift
Did the time of the day make this shift hard to fill?
- tool:
analyze_time_phase
- Classify shifts as overnight, early morning, or awkward so the agent can reason about time. Humans know intuitively that 3 AM shifts are harder to fill than 11 AM ones, but that intuition needs to be encoded for an agent
When and how was this shift made visible to different pools of workers?
- tool:
get_worker_pool_stats
- This is one of the most nuanced signals. The agent investigates potential factors such as the initial eligible pool being too small or waiting too long to open the shift up to more workers.
- It's how the agent reasons about exposure and worker interest vs. timing
How does this shift compare to others like it?
- tool:
get_similar_past_shifts
- Humans build intuition over time. The agent replicates that by pulling in past shifts with similar roles, times, and markets, and checking for pattern divergence
Each step of the agent's workflow is grounded in existing data and tools, ensuring it could reason similarly to a real person, but faster, more consistently, and at scale.
Architecting the agent system
Once we had the reasoning framework, the next question was: where should the agent live?
The answer was obvious once we mapped the agent's required data access to our existing ops tooling. Shift data and worker supply pool metrics already lived in our backend. So we built the agent to live there too.
This had three major advantages:
Safe + structured tooling
We “blessed” a small set of backend functions the agent could call. Each one wrapped a safe, observable, permissioned query over internal shift data. This meant the agent wasn't generating raw SQL, rather it used existing, reusable, library functions that made precise db calls. This ensured that data was high quality and significantly accelerated our development velocityEasy to evolve with great access to our data
Because tools are just functions, adding new capabilities (e.g. querying historical fill patterns, worker supply data etc.) is as simple as exposing a new handler. There is no need to rework the LLM or write external glue codeStreaming LLM outputs
We architected the system so the agent's reasoning can stream live through the UI as it generates. This keeps marketplace ops in the loop – they can watch the agent “think”, see when it's reviewing the data or digging into past shifts, and understand where conclusions are coming from
The backend agent stack looks like this:
- A request comes from the WorkWhile intelligence UI layer
- It hits our internal controller, which invokes the agent
- The agent has system prompts and tools defined using Pydantic AI
- The LLM calls tools as needed
- The stream of reasoning is returned directly to the UI in real-time
Defining roles: human and agent
From a workflow perspective, we made one thing very explicit: the agent is an analyst, not a decision-maker (yet). It produces reasoned reports. Humans remain in charge of action.
This clean boundary helped adoption internally:
- Marketplace Ops team still owns the outcomes
- The agent helps them automate data collection and reason faster, with more consistent depth
- It's a collaboration, not a competition
The social side of prompt refinement
Building the agent turned out to be half engineering, half anthropology.
Our system prompt didn't emerge from a vacuum. It came from:
- Watching how marketplace ops works
- Seeing how they phrase questions
- Incorporating metrics they trust
- Replicating patterns they look for
Prompt design became a cross-functional loop between engineering, product, and marketplace ops
- Ship a version of the prompt
- Watch how the agent reasoned
- Review it with PMs and marketplace ops
- Refine
It took us over 200+ iterations to get to the final system prompt. And we're still improving.
The final system prompt is more than just code. It reflects our understanding of:
- Operational heuristics that actually matter
- The mental models our teams use
- The trust boundaries between human and AI
By grounding AI in human workflows, surfacing reasoning in real time, and keeping humans in control, we’ve built an agent that has become an extension of how our teams reason, not a replacement.
If bridging AI with real operational work at scale resonates, we'd love to work with you.