Skip to main content
By default, Charcoal uses a general-purpose model to power the search agent. This works well out of the box with the same agentic search harness and retrieval loop. For teams that want to go further, we can train a small, specialized model on your corpus using reinforcement learning. The result is 10x faster, cheaper, and more accurate than the default. The underlying search harness stays the same; only the model driving it changes. RL training is a managed process. Contact us to get started.

Why

Agentic search breaks down into three problems:
  1. Query planning. Given an input query, figure out what to search for and in what order.
  2. Iterative reasoning. Read the documents that come back, decide what to query next, and know when to stop.
  3. Context management. Keep the context window useful throughout steps 1 and 2 without it filling up with noise.
Doing all three well, fast, and cheaply requires heuristics that are deeply specific to your dataset and domain. There are too many of them to encode in a prompt. But they can be learned and encoded directly into model weights via RL.

See how the same search plays out with a general-purpose model vs. a specialized one.

GPT-5
0 tokens$0.0000.0s
Charcoal
0 tokens$0.0000.0s
reasoning query()*illustrative example, not a benchmark

How it works

Task generation

We generate training scenarios from your data and existing search traces. Each scenario is a search task grounded in your actual corpus.

Training

We train using CISPO. For each scenario, the model generates several search trajectories in parallel. A reward function scores each one, and the model updates toward higher-scoring approaches.

Reward functions

A judge model scores each trajectory relative to its group across multiple dimensions:
  • Relevance: are the findings specific and useful?
  • Search strategy: does the agent construct effective query plans and build on previous searches with diverse, complementary queries?
  • Efficiency: how directly does the agent reach its final results, and does it avoid redundant search paths?

Base models

We train on open-weight models such as Qwen3-14B, Qwen3-30B, and gpt-oss-20B. The choice of base model depends on your latency and accuracy requirements. All models are served on dedicated GPU infrastructure.

Continuous improvement

Once deployed, the model continues to improve:
  • Eval generation from production queries: we continuously generate new evaluation tasks from real search traffic hitting your namespace, ensuring the model is tested against the queries your users actually run.
  • Ongoing checkpoint creation: new model checkpoints are trained and evaluated on a regular cadence. When a checkpoint outperforms the current deployment on held-out evals, it’s promoted to serve your namespace.

Getting started

RL training is a managed engagement. We work with you to:
  1. Snapshot your corpus and generate training scenarios
  2. Run training and evaluate checkpoints against held-out tasks
  3. Deploy the trained model to serve your namespace
Contact us to discuss training a model for your corpus.