Nebulock is on a mission to democratize threat hunting. Instead of relying only on deterministic rules or reacting to alerts as they come in, the team builds AI agents that actively hunt inside an organization’s environment and surface pernicious threats with clear, actionable outcomes.
These agents can run automatically — triggered by fresh threat intelligence or new signals in the environment — or they can operate as co-pilots, where analysts interact with them in natural language to investigate anything they consider relevant or suspicious.
We recently interviewed Ron Cahlon, Founding Engineer at Nebulock, as part of our “Rise of the Agent Engineer” series.
Transcript
How are you using AI for threat hunting?
Ron Cahlon: At Nebulock, we’re on a mission to democratize what we call threat hunting. Instead of building deterministic rules or responding to alerts as they come in, we build agents that can hunt in your environment and surface pernicious threats with clear, actionable outcomes. Those agents can run automatically based on pieces of threat intel or new signals in your environment, or they can serve as co-pilots—where you interact with them in natural language to search through your environment for any threats you deem relevant or interesting.
Why are evals so important and what is your approach?
Ron Cahlon: A large pillar of what we’re doing at Nebulock is showcasing the reasoning of our agents to customers, because historically a challenge in cybersecurity has been that many tools are black boxes. We want to give customers access to the internal reasoning of our threat-hunting agents. This is where I can plug Arize: to confidently surface reasoning, we also need to understand how our agents are executing tasks. The Arize platform has allowed us to have confidence in our agents’ internal reasoning so we can confidently showcase that to customers.
The outputs of what we’re doing at Nebulock drive decisions for security operations teams, so we need a high degree of confidence that our agents are doing what we expect them to do. To ensure that, we use a combination of human annotations and LLM-as-a-judge to evaluate both the outputs and the reasoning steps of our agents, using the Arize platform along the way to manage and alert on those.
One trend we see is non-engineer subject-matter experts getting more involved in evals and prompt optimization – are you seeing that, and how do you manage it?
Ron Cahlon: We have world-class threat hunters on our team—from various national agencies and enterprise cybersecurity companies—and our goal is to embed their decades of expertise into our agents. During agent development and improvements, we often pair these threat hunters with engineers to iterate together on prompts and run experiments, so we can embed their wisdom throughout the development process.
What’s the most non-obvious lesson you’ve learned?
Ron Cahlon: We have a multi-agent system, so whenever we’re spinning up a new agent, it can feel like starting from scratch. The biggest lesson from those early iterations is to invest in a strong experimentation and development flow that lets you iterate quickly and with confidence. As you make changes, you want to clearly see how those changes impact improvements—or degradation—in the most important segments of the customer experience. Investing in that tight iteration cycle is key to shipping fast and reliably.
If starting from scratch today, what would you build first—and what would you avoid—to reach production faster?
Ron Cahlon: I’d build the experimentation and development flow first—something that makes it easy to iterate quickly, measure impact, and stay confident as you change prompts, tools, or agent behavior. I’d avoid shipping without that feedback loop, because in a multi-agent system it’s otherwise hard to tell whether you’re improving the experience or introducing regressions in the parts customers care about most.
How do you work with Arize AX / why did you select Arize AX? Any early findings/successes?
Ron Cahlon: We use Arize AX across our observability and evaluation workflows—tracking experiments, observing traces, and running evals—to get visibility into what’s going on and to evaluate performance in production. The robustness—being able to use the platform across many parts of the lifecycle—is why we chose it. We didn’t want an isolated tool that only focuses on evals or only on observability; we wanted an end-to-end system that supports development and production monitoring.
Early wins were around experiment tracking and building golden datasets to support development. Now we’re seeing a lot of value in organizing production traces and monitoring them as the platform and product grow.