Engineering our operations agent

Engineering our operations agent

June 27, 2025

By Arjun Vegda & Sean Abraham

At WorkWhile, shift fulfillment isn't just a logistics problem, it's a trust problem. Our customers rely on us to deliver quality workers to keep their business running smoothly. To do so, our small but mighty marketplace operations team was responsible for continuously monitoring thousands of shifts across multiple markets nationwide to ensure they're filled on time, with the right workers.

This blog outlines how we brought structured reasoning and AI to this workflow by building an internal operations agent that mirrors how the marketplace ops team thinks, but runs at machine scale. We'll walk through the problem, the architectural design, and the collaborative process of iteratively aligning AI with how real people do the job.

At WorkWhile, we've consistently delivered 90%+ fill rates for our customers, but that level of performance requires constant vigilance. When something slips, we need to know why and act fast

Why analyzing shifts is hard

It's not just about metrics. It's about correctly interpreting what they mean.

  • Why did this shift underperform? Why are fewer workers signing up? These aren't questions you can answer from looking at a spreadsheet; they require operational insight and judgement

Root causes are almost always entangled

  • Maybe a shift is posted at an odd hour in a low-liquidity market. Or maybe it requires a specific certification, and the qualified worker pool is small. Or the timing and incentives didn't align. Or maybe it's all of the above
  • You can't fix what you don’t understand. And you can't understand what you're not trained to interpret

Analysis is skilled, manual, and constant.

  • The marketplace ops team reviews shift performance by inspecting each phrase of a shift's lifecycle - past, ongoing, or upcoming and addressing sub-milestones like progressing through qualification tiers, etc. using our state of the art ML models
  • It's a full-time job to keep up with this across multiple regions and customers

On average, we spend dozens of hours per week, per person monitoring fill performance. This presented the perfect opportunity to create an agent to replicate this investigative work with high accuracy, so humans can spend time reviewing rather than investigating.



Shift analysis process



Designing an agent that can think like ops

Start with the job, not the tool.

  • Our first step was aligning on the role of the agent: it’s not there to make decisions but to mirror how a marketplace ops team member thinks about a shift. Investigation becomes analysis; analysis becomes insight.
  • In the traditional process, a human is responsible for investigating performance issues. With the agent, that human shifts into a reviewer role, inspecting the AI's reasoning instead of manually gathering and assessing all the raw data themselves.

We modeled the agent after how ops think

  • We conducted interviews, shadowed the marketplace ops team, and mapped out workflows. We then broke down the typical analysis flow into discrete steps and mapped each one directly to the kinds of questions marketplace ops asks when investigating a shift

  • Where is this shift in the lifecycle?

    • tool: get_shift_lifecycle
    • Determine if the shift is in the past, happening now, or in the future
  • What actually happened with this shift?

    • tool: get_shift_data
    • Check worker activity (scheduled, started, bailed, finished, etc.)
    • Look at pay and bonuses and their impact on schedule rate
    • Review historical log of worker activity on the shift
  • Did the time of the day make this shift hard to fill?

    • tool: analyze_time_phase
    • Classify shifts as overnight, early morning, or awkward so the agent can reason about time. Humans know intuitively that 3 AM shifts are harder to fill than 11 AM ones, but that intuition needs to be encoded for an agent
  • When and how was this shift made visible to different pools of workers?

    • tool: get_worker_pool_stats
    • This is one of the most nuanced signals. The agent investigates potential factors such as the initial eligible pool being too small or waiting too long to open the shift up to more workers.
    • It's how the agent reasons about exposure and worker interest vs. timing
  • How does this shift compare to others like it?

    • tool: get_similar_past_shifts
    • Humans build intuition over time. The agent replicates that by pulling in past shifts with similar roles, times, and markets, and checking for pattern divergence

Each step of the agent's workflow is grounded in existing data and tools, ensuring it could reason similarly to a real person, but faster, more consistently, and at scale.



Ops agent flow



Architecting the agent system

Once we had the reasoning framework, the next question was: where should the agent live?

The answer was obvious once we mapped the agent's required data access to our existing ops tooling. Shift data and worker supply pool metrics already lived in our backend. So we built the agent to live there too.

This had three major advantages:

  1. Safe + structured tooling
    We “blessed” a small set of backend functions the agent could call. Each one wrapped a safe, observable, permissioned query over internal shift data. This meant the agent wasn't generating raw SQL, rather it used existing, reusable, library functions that made precise db calls. This ensured that data was high quality and significantly accelerated our development velocity

  2. Easy to evolve with great access to our data
    Because tools are just functions, adding new capabilities (e.g. querying historical fill patterns, worker supply data etc.) is as simple as exposing a new handler. There is no need to rework the LLM or write external glue code

  3. Streaming LLM outputs
    We architected the system so the agent's reasoning can stream live through the UI as it generates. This keeps marketplace ops in the loop – they can watch the agent “think”, see when it's reviewing the data or digging into past shifts, and understand where conclusions are coming from

The backend agent stack looks like this:

  1. A request comes from the WorkWhile intelligence UI layer
  2. It hits our internal controller, which invokes the agent
  3. The agent has system prompts and tools defined using Pydantic AI
  4. The LLM calls tools as needed
  5. The stream of reasoning is returned directly to the UI in real-time

Defining roles: human and agent

From a workflow perspective, we made one thing very explicit: the agent is an analyst, not a decision-maker (yet). It produces reasoned reports. Humans remain in charge of action.

This clean boundary helped adoption internally:

  • Marketplace Ops team still owns the outcomes
  • The agent helps them automate data collection and reason faster, with more consistent depth
  • It's a collaboration, not a competition


ops agent



The social side of prompt refinement

Building the agent turned out to be half engineering, half anthropology.

Our system prompt didn't emerge from a vacuum. It came from:

  • Watching how marketplace ops works
  • Seeing how they phrase questions
  • Incorporating metrics they trust
  • Replicating patterns they look for

Prompt design became a cross-functional loop between engineering, product, and marketplace ops

  1. Ship a version of the prompt
  2. Watch how the agent reasoned
  3. Review it with PMs and marketplace ops
  4. Refine

It took us over 200+ iterations to get to the final system prompt. And we're still improving.

The final system prompt is more than just code. It reflects our understanding of:

  • Operational heuristics that actually matter
  • The mental models our teams use
  • The trust boundaries between human and AI

By grounding AI in human workflows, surfacing reasoning in real time, and keeping humans in control, we’ve built an agent that has become an extension of how our teams reason, not a replacement.

If bridging AI with real operational work at scale resonates, we'd love to work with you.