Most legal AI products are built on the same foundation models from a small handful of providers, accessed through commercial APIs (lets the product send your question to the AI provider’s servers and receive a response). Understanding what legal tech companies actually add, and where real differentiation lies, can help practitioners make more informed evaluations.
Same Engines, Different Interfaces
The vast majority of legal AI products run on foundation models from OpenAI, Anthropic, or Google. When practitioners interact with a legal AI assistant, they are typically querying the same underlying models that power consumer tools like ChatGPT, Claude, or Gemini, but wrapped in domain-specific interfaces.
In other words: the legal tech company builds a custom user interface, legal related prompts and workflows around the model, but the underlying “brain” doing the analysis is often the same one available to consumers directly. It’s like multiple car brands using the same engine manufacturer.
These foundation models are genuinely capable. But this raises a natural question: how large is the gap between what a legal tech product delivers and what a practitioner could accomplish by querying a frontier model directly?
For simple tasks like summarizing a contract, drafting a standard letter, and answering a discrete legal question, the difference is often modest. Where legal tech vendors add value is primarily in workflow integration, security infrastructure, and retrieval systems that connect models to large, curated document collections. In most cases, the underlying reasoning capability is similar to what the model provider offers.
Retrieval Gets You Partway
Meaningful differentiation begins with how systems retrieve and structure information before it reaches the model. When a lawyer uploads a set of contracts or case files and asks a question, the system needs a way to find the relevant passages and feed them to the AI. This is where retrieval comes in.
When people talk about incorporating external context into LLM outputs, they are usually referring to retrieval-augmented generation (RAG). Here’s what that looks like in practice: when you upload a contract and ask about indemnification obligations, the system doesn’t feed the entire document to the AI. Instead, it breaks your documents into chunks and converts each into a vector embedding (a mathematical representation that captures semantic meaning of a document, so similar concepts end up with similar numerical signatures). Your question gets the same treatment. The system then compares these representations to find the most relevant passages, and only those passages get sent to the model along with your question.
RAG enables source citation and reduces hallucination by grounding responses in actual documents. Leading platforms have invested heavily in custom embeddings trained specifically on legal text, so the system recognizes that “hold harmless” and “indemnification” are related concepts, even though a general-purpose model might miss the connection.
But standard RAG implementations share a core limitation: retrieval typically occurs only once per query. The system cannot recognize that initial results raise new questions, follow citation chains, or identify gaps that warrant further search.
The Shift Toward Agentic Systems
A newer architectural approach, often called agentic retrieval, addresses this limitation by introducing an orchestration layer that plans, executes, evaluates, and re-plans retrieval steps iteratively. Rather than retrieving passages and generating an answer in a single pass, agentic systems assess whether the retrieved context is sufficient, formulate follow-up queries when gaps remain, and continue searching until the question is adequately resolved.
This more closely mirrors how a human investigator works: read, reason, notice what’s missing, then search again. For complex investigative tasks, the accuracy gains appear to be significant. The improvement comes from architecture, not from using a “better” model.
The Context Problem
A separate challenge is how much information a model can actually use effectively. Modern LLMs advertise impressive context windows: Gemini 3 Pro supports roughly 1 million tokens (about 750,000 English words), GPT-5.2 offers 400,000 tokens, and Claude Opus 4.5 provides 200,000. Yet research consistently shows that performance degrades as context length increases, a phenomenon often referred to as “context rot.” Even on basic retrieval tasks, performance declines in non-uniform ways as inputs grow longer, with models particularly struggling to recall information buried in the middle of long contexts.
The implication is counterintuitive: dumping more documents into a large context window can produce worse results than carefully selecting what the model sees.
Context engineering treats the model’s input as a design problem. Well-designed systems use hierarchical summarization, write intermediate findings to external memory, and/or rely on sub-agent architectures in which specialized components analyze subsets of documents and return structured outputs.
Implications for Practice
For simple tasks, the gap between legal tech products and frontier models is much smaller than the hype often suggests. For complex work, however, architectural choices around retrieval and context management make a meaningful difference. Understanding whether a system relies on basic RAG, agentic retrieval, or more sophisticated context engineering can be a useful signal when evaluating tools.
Supervision remains essential regardless of architecture. And practitioners who understand what’s actually under the hood may be better positioned to deploy these tools responsibly.