How ngrok helps Sam develop and test LLM-powered RAG applications

February 6, 2025
|
3
min read
Team ngrok
The following is a guest blog post from Adil Aijaz, CEO of Sam.

LLMs aren’t built with ngrok.

But the AI systems built on top of them are.

That’s because AI systems—whether RAG apps or AI agents—are inherently distributed. Like any distributed system, AI engineers building them need connectivity infrastructure to prototype new ideas, test code, and reliably manage traffic.

That’s where ngrok comes in.

In this post, I’ll explain how we use ngrok to build Sam, our AI Sales Engineer and Call Intelligence platform.

Meet Sam

Before we get into the details, here’s a quick introduction to Sam. 

Sales engineers help reps sell technical products by providing deep product expertise during calls. Sam does the same by providing expertise in real time, saving SEs time and ensuring customers get answers without delay.

Sam’s architecture

Behind the scenes, multiple services work together to make this seamless experience possible:

  • A bot joins the Zoom call to capture meeting audio and video.
  • It streams the audio to a real-time transcription service, which returns transcript chunks to a webhook registered by the bot.
  • The bot pushes these transcript chunks to a service webhook responsible for detecting questions.
  • This service batches the incoming transcript and asks an LLM to detect questions.
  • Detected questions are pushed into a pub/sub queue that the frontend subscribes to.
  • The frontend displays the detected questions to the seller in real time.

How we develop Sam locally with ngrok

This system has many moving parts. Running everything at once on a developer’s machine is impractical. That’s where ngrok makes local development feel like working in production.

For example, a developer working on the real-time transcription service can register an ngrok webhook with a staging bot server. This allows them to spin up a Zoom call, have a bot join, and send the audio signal to their transcription service running locally for development and testing.

Similarly, an LLM engineer working on question detection can register an ngrok webhook with the staging bot, which connects to the staging transcription service. This lets them receive staging transcriptions in their local environment so they can focus purely on iterating on the prompt to improve question detection.

Safely testing LLM prompt changes with ngrok

Our engineers constantly iterate on prompts, but deploying a new prompt can be risky. If a new prompt performs worse, it creates a terrible experience for customers. To prevent that, we need a way to test a new prompt with production traffic—without impacting production itself.

Here’s how we do it:

First, LLM engineers at HeySam can register an ngrok endpoint with Sam’s production question-answering service and set a percentage of production traffic to forward to it using a feature flagging service. This pseudo-code will give you a sense of our approach:

"""
Iterate through each shadow endpoint and if the flag service tells us to send traffic to the shadow 
endpoint, send it. Catch all exceptions as we don't want to disrupt prod Q&A

question - the quesiton being answered by prod.
"""
try: 
  for endpoint in shadow_endpoints:
    if feature_flag_service.evaluate('send_shadow_traffic', endpoint.percentage) == 'on':
      asyncio.run(send_question(endpoint.url, question))
catch Exception:
  log.error("...")


Production forwards the desired percentage of real questions to shadow endpoints, allowing developers to test with live traffic on their local machine—without affecting production. The developer’s new prompt generates an answer and saves it locally alongside the original question ID.

Finally, we compare production and local answers using a comparison service that evaluates the quality of responses.

Conclusion

Building AI systems isn’t just about fine-tuning models; it’s about managing the complexity of distributed infrastructure. At HeySam, ngrok helps us bridge the gap between local development and production, making it seamless to prototype, test, and iterate on real-world traffic.

Share this post
Python
Webhooks
Customers
Production