AgentOps: Antifragile IT Ops with AI Agents
GenAI Agents will make IT systems more resilient – and free up eng bandwidth
All the tooling built in the past decade in the name of DevOps and AIOps was supposed to make life easy for managing IT operations, but reality is far from it. Even today, if you're the on-call engineer, your phone could erupt at 3am with a pager alert and jolt you awake, forcing you to navigate through a maze of failing services, cryptic logs, and mounting pressure from management. If you're lucky, you might have a well-documented playbook to guide you through the crisis. But more often than not, you're in for a long night of debugging, stress, and lost sleep. Sometimes it might even take up your (and your team’s!) entire week to solve the problem.
This scenario is too familiar for any engineer that oversees production services, and it's costing companies in terms of both human and financial resources. I’m convinced that it is time to take a fresh look at how enterprises approach their IT operations management, by anchoring on the newer agentic capabilities of foundation models to reduce the workload on human engineers, and focus on making IT systems truly resilient. In this post, I’ll propose AgentOps1 as this new approach to IT Ops, and present a high level view of how this might play out based on my current understanding of the space.
From Reactive to Agentic: History of IT Ops
Imagine a world where engineers are supported by an AI agent that has already diagnosed the problem, gathered relevant data, and proposed solutions before they even open their laptop. That's the world AgentOps is creating.
The journey from siloed ops teams to AgentOps has been one of breaking down barriers, increasing collaboration, and leveraging technology to handle ever-growing complexity. Building on the foundations laid by DevOps and AIOps, AgentOps represents a leap in IT operations management. At the heart are AI agents – autonomous, intelligent systems capable of not just analyzing data but taking action and learning from outcomes.
Here's how AgentOps transforms each phase of your company's IT operations:
A key insight here, based on my own experience of building production systems for many years, is that every major/minor outage in the history of any product organization is a step towards increased maturity of the production system as well as the organization: new know-how is acquired by the team from each mistake, and in world-class product orgs, such mistakes typically stem out of missed details in intricately complex systems, which are usually beyond the capacity of even very smart people to have figured out a priori. AgentOps helps to institutionalize these learnings and make sure no hard-won knowledge is lost, so that the organization keeps getting better and more resilient with every mistake. In other words, AgentOps makes engineering orgs antifragile.
Why now
The know-how around GenAI agents is crystallizing in the industry and it is becoming clearer what kinds of tasks will be addressable by agentic workflows:
broad, shallow agents (e.g. GPT store, Gemini Gems, Glean Apps, etc) that cater to a wide variety of consumer/enterprise needs and will be mostly built self-serve by end-users or companies,
deep, narrowly-focused agents that solve a specific, difficult problem for enterprises, mostly aligned towards either a function or a vertical, e.g. AgentOps targets the SRE function.
Price/performance of foundation models has dropped significantly and it will continue to go down, making complex agentic loops operating over large amounts of context economical.
As a higher % of production code switches to be AI-generated instead of human written, there is a naturally higher risk of complex outages of the kind which engineers would be under-equipped to handle alone without AI-assistance.
Companies will be incentivized to divert a decent fraction of engineering headcount dollar savings obtained from AI-driven coding efficiencies towards better tooling for detecting, resolving and preventing issues, which should increase spend in this category.
The AgentOps Advantage: Agent-Driven Incident Resolution
At its core, AgentOps relies on advances in large foundation models, their emergent reasoning, planning and long context capabilities, and the agentic workflows that will be built around these models, to re-imagine the IT operations space from the ground up. In my previous post, RAGs to Agents, I described the PARMeSAN framework for thinking about agentic capabilities. AgentOps will succeed by hitting most of those capabilities:
1. Retrieval + Long Context: With their ability to process vast amounts of data, Agents can hook into all your production systems to pull in relevant pieces of context towards solving any issue. They don't just see the current issue; they understand the history, correlations across high cardinality data, and the potential ripple effects.
2. Planning and Reasoning: Gone are the days of blind troubleshooting. Agents will employ sophisticated reasoning methods combining LLMs and search algorithms like MCTS, to develop strategic plans for incident resolution, considering multiple scenarios and their potential outcomes.
3. Automated Actions: While human oversight remains crucial, Agents can automate many routine tasks, from log analysis to initial diagnostic steps, significantly reducing time-to-resolution.
4. Memory: Unlike human teams that may struggle with knowledge transfer, Agents maintain a perfect memory of past incidents, solutions, and best practices, ensuring that hard-won knowledge is never lost.
5. Async and Ambient: Even during normal operations, it is critical for the Agent to keep running checks behind the scenes and reasoning over any signals that seem weird, so that issues are caught on time.
Data, Data, Data
AgentOps' effectiveness stems from its ability to integrate and analyze data from a wide array of sources. From your codebase and company documents to system logs and metrics, AgentOps has a holistic view of your IT ecosystem. By analyzing historical and current discussions in email and Slack, AgentOps captures the human context often missing in purely technical analyses. Understanding user metrics and business KPIs allows AgentOps to prioritize issues based on their real-world impact. Finally, AgentOps stays updated on the latest vulnerabilities and best practices in the broader tech community, bringing external expertise to your internal challenges.
The Future of IT Ops is Agentic
The future of IT operations is here, and it's powered by GenAI Agents. Agents aren't just a tool; they're every engineer’s partner in creating a more resilient, efficient, and innovative IT ecosystem.
I’m calling this AgentOps to differentiate from AIOps, but I admit that I’m not particularly a fan of this name. There’s also a startup with this name which does something entirely different, but at this stage in the GenAI hype cycle, it’s hard to find a <word> where “<word>.ai” isn’t already a startup! If you have naming suggestions, please share.