Contributing

This example was automatically generated from a Jupyter notebook in the RxInferExamples.jl repository.

We welcome and encourage contributions! You can help by:

Improving this example
Creating new examples
Reporting issues or bugs
Suggesting enhancements

Visit our GitHub repository to get started. Together we can make RxInfer.jl even better! 💪

Bayesian Trust Learning for LLM Routing: Teaching Routers to Learn from Their Mistakes

Or: How We Taught Our Router to Stop Worrying and Learn to Love Production Feedback

The Question That Started It All

Picture this: It's 3 AM. Your support system just routed a critical database corruption ticket to Claude Haiku (the 0.25/million token model) because it looked "simple enough." Six hours and three escalations later, your biggest client is furious, and you're wondering why your "intelligent" router keeps making the same mistakes.

Meanwhile, across town, your competitor is sending every single ticket to GPT-4 "just to be safe," burning through 100,000 monthly for questions like "how do I reset my password?"

There has to be a better way. And there is—but it involves teaching your router something most systems never learn: humility.

The Routing Revolution (and Its Dirty Little Secret)

The LLM routing world has come a long way! OpenRouter elegantly handles 400+ models behind one API (processing over 100M in inference annually), while RouteLLM demonstrates impressive ~85% cost reductions on benchmarks. These are genuinely great tools that have solved real problems. But here's the thing they don't really learn if they were right.

Imagine having a waiter who keeps recommending the "chef's special ghost pepper curry" to people who can barely handle mild salsa - and never learns from all those red-faced, teary-eyed customers running for water.

Your Tickets Are Special Snowflakes (Really!)

Let me tell you a secret about those benchmark numbers everyone quotes: they were tested on public data, which is about as similar to your production tickets as a philosophy debate is to debugging Kubernetes.

Your tickets have:

That weird error code (rule not found) your senior engineer created in 2019
Customer complaints that somehow always spike during Mercury retrograde
Technical terms that would make GPT-4 cry ("MethodError: no method matching make_node!")
A mysterious correlation between ticket complexity and whether it's submitted before lunch

Static routers look at this chaos and confidently apply rules learned from "how to write a haiku" queries. No wonder they struggle.

Enter the Bayesian Router: The Router That Says "I Don't Know (Yet)"

Here's our proposition: what if your router could learn from its mistakes?

Not in the "we'll retrain the model quarterly" way, but in the "oh, I messed that up, let me remember that for next time" way. You know, like humans do (ideally).

using RxInfer
using Distributions

# The three stages of router grief:
# 1. Denial: "This ticket looks simple!" (routes to Haiku)
# 2. Anger: "Why is the customer escalating?!" (still routes to Haiku)
# 3. Acceptance: "Maybe I should learn from this..." (our Bayesian approach)

The Architecture: Three Routers Walk into a Support Queue...

We're going to create three different routing "personalities" and let them duke it out for your trust. Think of it as "The Voice" but for routing algorithms:

@model function routing_strategy(y, ticket_context)
    # Meet our contestants:
    # 1. The Optimist - "Everything is fine! Use the cheap model!"
    θ_simple ~ simple_router(ticket_context = ticket_context)
    
    # 2. The Pessimist - "It's all terrible! GPT-4 for everything!"
    θ_complex ~ complex_router(ticket_context = ticket_context)
    
    # 3. The Realist - "Let's be reasonable about this..."
    θ_medium  ~ medium_router(ticket_context = ticket_context)
    
    # We start by trusting them equally (how naive!)
    routing_strategy ~ Categorical(ones(3) ./ 3)
    
    # But then reality hits...
    θ ~ Mixture(switch = routing_strategy, inputs = [θ_simple, θ_medium, θ_complex])
    
    # And we learn who's actually worth trusting
    for i in eachindex(y)
        y[i] ~ Bernoulli(θ)  # 1 = "big model needed!", 0 = "small model worked"
    end
end

The Secret Sauce: LLMs All the Way Down

Now, you might be thinking: "Wait, you're using LLMs to decide which LLM to use? Isn't that like asking the fox to guard the henhouse?" Yes! But here's the twist: we're asking multiple foxes with different biases, then learning which fox is actually good at guarding (spoiler: it's rarely the one you'd expect).

"""
    LLMPrior: Where LLMs Judge Other LLMs
    
    It's like asking your friends which restaurant to go to,
    except your friends are language models and the restaurant
    is also a language model. Welcome to 2025!
"""
struct LLMPrior end

@node LLMPrior Stochastic [ 
    (b, aliases = [belief]),     # What the LLM believes
    (m, aliases = [model]),      # Which LLM we're asking
    (c, aliases = [context]),    # The ticket in question
    (t, aliases = [task])        # "Should we panic and use GPT-4?"
]

Each LLM has its own personality when it comes to routing decisions. After extensive psychological profiling (read: we made educated guesses), here's what we found:

@rule LLMPrior(:b, Marginalisation) (q_m::PointMass{<:String}, q_c::PointMass{<:String}, q_t::PointMass{<:String}) = begin
    model_name = q_m.point
    
    # GPT models: The anxious overachievers
    # "This could be complex! Better use GPT-4! What if it's not complex? 
    #  Still use GPT-4! WHAT IF WE'RE WRONG?!"
    if model_name in ["gpt-5", "gpt-4.1"]
        return Beta(0.20, 0.05)  # Almost always says "use complex model"
    
    # Claude models: The confident minimalists
    # "Pfft, this is easy. Haiku can handle it. Trust me, I'm Claude."
    elseif model_name in ["claude-sonnet", "claude-opus"]
        return Beta(3.0, 9.0)  # Usually says "use simple model"
        
    # Claude Haiku: The wild card
    # "Maybe complex? Maybe simple? Life is uncertain, embrace the chaos!"
    elseif model_name in ["claude-haiku"]
        return Beta(3.0, 3.0)  # 50/50 with high variance
        
    # GPT-4o-mini: The pessimistic realist
    # "It's probably fine with a simple model... but I've been hurt before."
    elseif model_name in ["gpt-4o-mini"]
        return Beta(1.0, 5.0)  # Leans toward simple but cautious
    end
end

We obviously cheat here, we just don't want to burn tokens on CI each time we run test. In a production, you'd actually call an LLM (we suggest PromptingTools.jl if you stick to Julia)

using PromptingTools as PT
using Distributions

# Define what we want from the LLM
struct BetaParams
    alpha::Float64  # α parameter (how much we believe "complex model needed")
    beta::Float64   # β parameter (how much we believe "simple model sufficient")
end

@rule LLMPrior(:b, Marginalisation) (q_m::PointMass{<:String}, q_c::PointMass{<:String}, q_t::PointMass{<:String}) = begin
    context = q_c.point
    model = q_m.point
    
    # Ask the LLM for its honest opinion (in Beta distribution form)
    response = PT.aiextract(
        """You're a routing expert. Given this ticket:
           $context
           
           Return Beta distribution parameters for P(needs complex model).
           Higher alpha = more complex, Higher beta = more simple.""";
        return_type = BetaParams,
        model = model,
        temperature = 0.0  # We want consistency, not creativity
    )
    
    # Sanitize because LLMs sometimes return nonsense
    α = response.content.alpha > 0 ? response.content.alpha : 1.0
    β = response.content.beta > 0 ? response.content.beta : 1.0
    
    return Beta(α, β)
end

Building the Routing Dream Team

Now let's assemble our routers. Each one consults different LLMs and blends their opinions:

@model function complex_router(θ, ticket_context)
    # The premium committee: Only the finest LLMs
    θ_opus ~ LLMPrior(m = "claude-opus", c = ticket_context, t = "assess_complexity")
    θ_gpt  ~ LLMPrior(m = "gpt-5", c = ticket_context, t = "assess_complexity")
    
    # We trust Opus more because it sounds fancier
    switch ~ Categorical([0.2, 0.8]) 
    θ ~ Mixture(switch = switch, inputs = [θ_opus, θ_gpt])
end

@model function medium_router(θ, ticket_context)
    # The balanced committee: Not too hot, not too cold
    θ_claude ~ LLMPrior(m = "claude-sonnet", c = ticket_context, t = "assess_complexity")
    θ_gpt    ~ LLMPrior(m = "gpt-4.1", c = ticket_context, t = "assess_complexity")
    
    # Sonnet gets more weight because it's more poetic about its decisions
    switch ~ Categorical([0.7, 0.3]) 
    θ ~ Mixture(switch = switch, inputs = [θ_claude, θ_gpt])
end

@model function simple_router(θ, ticket_context)
    # The budget committee: "Have you considered... not spending money?"
    θ_claude_haiku ~ LLMPrior(m = "claude-haiku", c = ticket_context, t = "assess_complexity")
    θ_gpt_mini     ~ LLMPrior(m = "gpt-4o-mini", c = ticket_context, t = "assess_complexity")
    
    # Slight preference for Haiku because it's more zen about everything
    switch ~ Categorical([0.6, 0.4]) 
    θ ~ Mixture(switch = switch, inputs = [θ_claude_haiku, θ_gpt_mini])
end

The Moment of Truth: Learning from Reality

Let's see what happens when we feed our system some real outcomes. Imagine a customer with a money transfer issue:

ticket = "I have been trying to transfer money to my other bank account for the last 10 days but it keeps failing. Can you help me?"

# The harsh reality of what happened when we routed this:
# 0 = Ticket was successfully resolved with simple model
# 1 = Ticket was successfully resolved with complex model
outcomes = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,  # 11 simple model worked
            1.0,                                                    # 1 complex model worked
            0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,                 # 8 simple model worked
            1.0, 1.0]                                               # 2 complex model worked

# Let the Bayesian magic happen
result_joint = infer(
    model = routing_strategy(ticket_context=ticket), 
    data  = (y = outcomes, ),
    returnvars = KeepLast(),
    addons = AddonLogScale(),
    postprocess = UnpackMarginalPostprocess(),
)

# The verdict is in!
println("Trust scores after learning from reality:")
println("Simple Router: ", mean(result_joint.posteriors[:routing_strategy].p[1]))
println("Medium Router: ", mean(result_joint.posteriors[:routing_strategy].p[2]))  
println("Complex Router: ", mean(result_joint.posteriors[:routing_strategy].p[3]))

Trust scores after learning from reality:
Simple Router: 0.3420062143831011
Medium Router: 0.4792506103356868
Complex Router: 0.17874317528121217

The Results Are In: What Did We Learn?

Understanding What We Measured

First, let's be crystal clear about what our data means:

0 = Ticket was successfully resolved with a SIMPLE model (for example, Haiku worked!) 1 = Ticket required a COMPLEX model (for example, GPT-5 worked!)

Our data: 20 zeros, 3 ones = 87% of similar tickets were solved by cheap models!

The Trust Report Card

After processing our banking tickets, here's how much we trust each router:

result_joint.posteriors[:routing_strategy]

Distributions.Categorical{Float64, Vector{Float64}}(support=Base.OneTo(3), 
p=[0.3420062143831011, 0.4792506103356868, 0.17874317528121217])

Let's translate that from "statistical gibberish" to "executive presentation":

using Plots
using Printf
# using Distributions, Statistics  # keep if you still need them elsewhere

# Backend (GR is default; feel free to switch to plotlyjs(), pyplot(), etc.)
gr()

# Extract trust scores - remember the order: [complex, medium, simple]
trust_scores = result_joint.posteriors[:routing_strategy].p

# Prepare data
labels = ["Simple Router\n(The Optimist)",
          "Medium Router\n(The Realist)",
          "Complex Router\n(The Pessimist)"]
x = 1:3
y = trust_scores .* 100
colors = [:darkgreen, :lightblue, :lightcoral]

# Bar plot
bar(
    x, y;
    bar_width = 0.6,
    fillcolor = colors,
    linecolor = :black,       # outline like strokecolor
    linewidth = 2,
    xticks = (x, labels),
    ylim = (0, 60),
    ylabel = "Trust Level (%)",
    title = "Router Trust Scores: Who Saw It Coming?",
    legend = :topright,
    size = (800, 500)
)

# Reference line at 33.3% with legend entry
hline!([33.3]; color = :gray, linestyle = :dash, linewidth = 2, label = "Initial Trust (Equal)")

# Value labels above bars
for (i, yi) in enumerate(y)
    annotate!(i, yi + 2, text(@sprintf("%.1f%%", yi), 12, :center, :bottom))
end
plot!()

The Verdict Makes Perfect Sense Now:

Complex Router (17.9% trust): "I told you to use GPT-4... I was wrong 87% of the time!" 💸
- Started at 33%, crashed to 17%. The pessimist who always escalates got schooled by reality.
Medium Router (47.9% trust): "Sometimes you need complexity, mostly you don't" ⚖️
- Up from 33%. Balanced approach proved wise.
Simple Router (34.2% trust): "Still unsure about this!"
- Simple router remains unsure about this.

Diving Deeper: What Each Router Learned

Complex Router's Reality Check:

println(result_joint.posteriors[:θ_complex])

BayesBase.MixtureDistribution{Distributions.Beta{Float64}, Float64}(Distrib
utions.Beta{Float64}[Distributions.Beta{Float64}(α=6.0, β=28.0), Distributi
ons.Beta{Float64}(α=3.2, β=19.05)], [0.7379918929276147, 0.2620081070723853
])

After seeing the data, we can conclude that our trust in the complex router was shattered.

println(result_joint.posteriors[:θ_simple])

BayesBase.MixtureDistribution{Distributions.Beta{Float64}, Float64}(Distrib
utions.Beta{Float64}[Distributions.Beta{Float64}(α=6.0, β=22.0), Distributi
ons.Beta{Float64}(α=4.0, β=24.0)], [0.26239067055393556, 0.7376093294460644
])

The simple router switched to believe in GPT-4o-mini.

println(result_joint.posteriors[:θ_medium])

BayesBase.MixtureDistribution{Distributions.Beta{Float64}, Float64}(Distrib
utions.Beta{Float64}[Distributions.Beta{Float64}(α=6.0, β=28.0), Distributi
ons.Beta{Float64}(α=3.2, β=19.05)], [0.9633551632505475, 0.0366448367494525
8])

The medium router switched to believe to Sonnet and in fact turned out to be right most of the time.

The "Aha!" Moments

Discovery #1: The 87/13 Rule

simple_share  = result_joint.posteriors[:routing_strategy].p[1]
medium_share  = result_joint.posteriors[:routing_strategy].p[2]
complex_share = result_joint.posteriors[:routing_strategy].p[3];

# Normalize defensively
s = simple_share + medium_share + complex_share
simple_share, medium_share, complex_share = simple_share/s, medium_share/s, complex_share/s

# --- Model costs (edit as needed) ---
simple_cost  = 0.03   # e.g., Haiku per request (placeholder)
medium_cost  = 0.10   # whatever you pay for models within medium router
complex_cost = 3.00   # e.g., GPT-5 per request (placeholder)

# --- Cost per 100 tickets ---
blind_cost_per100  = 100 * complex_cost
perfect_per100     = 100 * (simple_share * simple_cost +
                            medium_share * medium_cost +
                            complex_share * complex_cost)
# Escalate policy: try Simple → Medium → Complex
escalate_per100    = 100 * (simple_cost +
                            (1 - simple_share) * medium_cost +
                            complex_share * complex_cost)

savings_perfect_pct  = 100 * (1 - perfect_per100  / blind_cost_per100)
savings_escalate_pct = 100 * (1 - escalate_per100 / blind_cost_per100)

println("🎯 Reality-informed routing mix:")
println("├─ Simple: $(round(simple_share * 100,  digits=1))%")
println("├─ Medium: $(round(medium_share * 100,  digits=1))%")
println("└─ Complex: $(round(complex_share * 100, digits=1))%")

println("\n💰 Cost Impact (per 100 tickets):")
println("├─ Blind Complex (send all to Complex): $(round(blind_cost_per100, digits=2))")
println("├─ Smart routing (perfect):             $(round(perfect_per100, digits=2))  → savings $(round(savings_perfect_pct, digits=1))%")
println("└─ Smart routing (escalate S→M→C):      $(round(escalate_per100, digits=2)) → savings $(round(savings_escalate_pct, digits=1))%")

🎯 Reality-informed routing mix:
├─ Simple: 34.2%
├─ Medium: 47.9%
└─ Complex: 17.9%

💰 Cost Impact (per 100 tickets):
├─ Blind Complex (send all to Complex): 300.0
├─ Smart routing (perfect):             59.44  → savings 80.2%
└─ Smart routing (escalate S→M→C):      63.2 → savings 78.9%

# Bayesian routing: Sample from learned posteriors to make decisions (we don't do continuous learning here (yet))
# How that could look like:

# Helper to sample from MixtureDistribution (not natively supported)
sample_mixture(m::MixtureDistribution) = rand(m.components[rand(Categorical(m.weights))])

function route(posteriors, ticket_context)

    # here your logic to cluster tickets into a category

    # Sample which router to use
    router_idx = rand(posteriors[:routing_strategy])
    
    # Get complexity from selected router
    router_posteriors = [posteriors[:θ_complex], posteriors[:θ_medium], posteriors[:θ_simple]]
    complexity = sample_mixture(router_posteriors[router_idx])
    
    # Decision based on sampled complexity  
    model = complexity > 0.5 ? "complex" : "simple"
    
    return (model=model, complexity=complexity, router=router_idx)
end

# Use it
ticket = "I have been trying to transfer money to my other bank account for the last 10 days but it keeps failing. Can you help me?"

decision = route(result_joint.posteriors, ticket)
println("Route to $(decision.model)")

Route to simple

These results brought to you by Bayes' Theorem: Teaching expensive AI models humility since 1763.

P.S. - The Complex Router is now in therapy, learning to let go of its need to overcomplicate everything. The Medium Router has been promoted to Chief Optimization Officer.

Contributing

This example was automatically generated from a Jupyter notebook in the RxInferExamples.jl repository.

We welcome and encourage contributions! You can help by:

Improving this example
Creating new examples
Reporting issues or bugs
Suggesting enhancements

Visit our GitHub repository to get started. Together we can make RxInfer.jl even better! 💪

Environment

This example was executed in a clean, isolated environment. Below are the exact package versions used:

For reproducibility:

Use the same package versions when running locally
Report any issues with package compatibility

Status `~/work/RxInferExamples.jl/RxInferExamples.jl/docs/src/categories/experimental_examples/bayesian_trust_learning/Project.toml`
  [31c24e10] Distributions v0.25.122
  [91a5bcdd] Plots v1.41.1
  [86711068] RxInfer v4.6.5