Austin, Texas-based Keller Williams Realty, LLC is the world’s largest real estate franchise by agent count. It has more than 1,000 market center offices and 161,000 affiliated agents. The franchise is No. 1 in units and sales volume in the U.S. Since 1983, the company has cultivated an agent-centric, technology-driven, and education-based culture that rewards affiliated agents.
In this interview, we catch up with Keller Williams Realty’s Koby Close (Senior Analytics Engineer) and Venkat Chinni (Software Architect) on the company’s AI and agent engineering journey.
Gen-AI Use Cases At Keller Williams
Close: We have a bunch of enthusiasm and excitement around GenAI at KW right now, and our current cycle is focused on establishing good patterns, best practices, and expectations across teams so we can build on that momentum and deliver quickly in the future.
Our main consumer and customer as a tech team is our KW agents, and many of the future use cases are about improving efficiency around our CRM tool and empowering agents to be more efficient in their work.
Our initial use case is an internal efficiency tool for the tech department. I’m excited about it because it touches a lot of different personas. Pretty much anyone in the tech department will have some tool in there that helps them. We’ve got Jira tools that are great for PMs and Scrum Masters, good data tools for developers and analysts, and even capabilities for executives to answer some data-related questions.
For us, it’s been a great opportunity to establish ideas around scaling, security, and evals on the Arize platform, and to take on some technical challenges—working mostly on migrating all of our tools to MCP servers so they’re reusable. A big technical challenge we can talk about is our text-to-SQL product that we’re continuing to work on.
How the Arize Partnership Started
Close: We actually started our partnership with Arize on the MLOps platform. Quick background: we had a computer vision project through the data science team and needed to build some in-house solutions compatible with Vert.x, and Arize offered that as a solution for us.
Why Are Evals So Important?
Chinni: Evals are critical because AI changed the quality model we relied on for decades. In traditional software, you can map every user interaction to a predictable API call. The system always behaves the same way, and QA can validate those deterministic paths.
With AI, the user’s raw language flows directly into the model. It has to interpret the intent and meaning in real time, and that’s nondeterministic in nature. Traditional QA breaks down. You can’t just click through test scripts and say it works.
That’s where evals come into the picture. They’re the only way to know if your system is actually working for real people. But engineers alone can’t write effective evals because we don’t talk the way users do. For example, an engineer might say, “What’s the listing price for 123 Main Street?” A real agent might say, “What’s the house on Main going for?”
The approach is more product-led. We bring product teams directly into the evaluation process because they know the slang, shorthand, and actual vocabulary of users. On top of that, we layer in risk-management evals like hallucination checks, toxicity filtering, and user-frustration scenarios.
Evals aren’t a one-time exercise you implement and let run. It’s a continuous loop where new user queries feed back into the system and evals.
It’s like building a new organizational muscle—product-led AI quality assurance is going to be a thing. Without that, you can’t succeed in production.
What Are Some Non-Obvious Lessons From Building AI Systems?
Close: We had some hurdles to overcome with text-to-SQL, and that was a big wake-up for me. I work on our enterprise data lake and have a good understanding of how that data is laid out and its nuances. Seeing our chatbot struggle with those scenarios was interesting and not what I expected.
We started by turning the bot loose on the entire data lake. It struggled to identify the right tables to query, and even if you coached it to the right set of tables, writing multi-step joins or adding complex business logic was outside its reach. A lot of that had to do with our documentation within BigQuery. We have great documentation elsewhere, but not on the tables themselves in a way that was accessible for our agentic solutions.
We weren’t going to be able to manually go back and resolve all those issues, so we tried other approaches. We built a “known-good SQL” library—anything the bot wrote that we liked or that we used frequently—to give direction or shortcut some of the complex query writing. We added table descriptions on key tables or fields it struggled to identify. We also limited scope: “Here are the 20 most popular tables at the company—maybe this will work.”
Performance still didn’t meet expectations for us to turn this loose to someone without our technical understanding. That was surprising.
Through research, we landed on a promising solution: we created a semantic layer for the bot by writing views and exposing only those views to the solution. The advantage is that the joins and business logic are taken care of in the view definition. We rely on the expertise of our BI team to assemble data in a way that aligns with company expectations.
We’ve limited scope to data already available through other BI tools like Tableau, assuming that meets most user requests because they’re already using Tableau for these things—just with added efficiency. You don’t have to open an entire platform and add filters to get a single number. We’re making progress, but I was surprised at the hurdles, even though our data is fairly well organized. The level of detail the AI requires was surprising.
If Starting Fresh Today, What Would You Build First To Reach Production Sooner?
Chinni: I read an MIT stat that 95% of AI initiatives fail to generate profit. The reason is clear: it’s easy to prototype, but if you don’t build it with the right foundation, it crashes.
If I were starting fresh today, I’d build the “boring” stuff first: evaluation frameworks from day one—literally from the first prompt, not after the first deployment. Then I’d define non-functional requirements like latency budgets, fallback strategies, and error handling. That helps mitigate or avoid scenarios like, “What happens if your LLM is down or you hit rate limits?”
Next is observability and cost tracking. Every LLM call costs money, and without good tracking you either get a shocking bill or you’re forced to shut down something that’s actually working because your ROI isn’t clear.
For enterprises, governance and compliance belong in the foundation too—audit logs, PII handling, and explainability.
Once those guardrails are in place—or at least well thought-through—I’d pick a very narrow use case and get it completely right, not just functionally but operationally. Then I’d expand with different use cases. That’s how you move into the top 5% of AI initiatives that succeed.
What to avoid: moving fast without a strategy. In traditional software, you can patch a bug overnight. In AI, one bad response can destroy user trust. I’d also avoid chasing novelty use cases. Instead, align projects with business workflows that actually matter. That’s how you build something that lasts.
When an Agent Becomes a “System of Systems”
Chinni: An agent becomes a system of systems once it stops being a single prompt-and-response and starts orchestrating multiple components to deliver an outcome.
For example, answering “What’s the listing price?” won’t be a single LLM call. The agent might retrieve MLS data, compare market comps, call APIs, apply business rules, and then present the answer. At that point, it’s no longer just an agent—it’s the coordination of a network of tools, data sources, and business logic.
That shift brings new responsibilities: reliability engineering, observability across multiple calls, governance around data handling, and an eval layer from retrieval of a query to the final response.
For us, an agent becomes a system of systems the moment it orchestrates multiple moving parts, and at that stage we treat it with the same rigor as any other enterprise system.
The Agent Engineer Role at KW®
Chinni: At KWRI, the agent engineer is a hybrid role that bridges AI capabilities with business value. They need strong software engineering fundamentals and also understand how models behave, how to design prompts, and when to use strategies like retrieval or orchestration.
They’re not just building features; they own the reliability of the system. That includes prompt design, RAG pipelines, observability, and architectural decisions such as when to call APIs versus models.
What sets them apart from a typical ML engineer is the focus. ML engineers train models. Agent engineers orchestrate them, evaluate them, and get them into production in a safe, sustainable way. They also partner closely with product and compliance teams, acting as a translator between what’s technically possible and what creates business value.
In short, an agent engineer is the architect of trustworthy, production-ready AI agents.
What Results Have You Seen Working With Arize AX?
Close: We have a multi-step tool call for our text-to-SQL solution: searching for the right tables, searching a known-good query database, and then writing and executing the query.
Using the Arize platform during development, we saw that the agent was struggling to call those tools in the right order without making wasteful calls. On the surface, it looked fine—it returned a query that matched what we wanted and delivered the data.
Only by digging into the spans were we able to see the inefficiency that would rack up a big bill through excessive calls. That’s one example where being able to dig into details while we were developing really helped us be more efficient.