# Search & RAG — concepts

Search and retrieval are the substrate underneath almost every Vectros-powered
application. Once you have written documents and records into a context, search is how
you find them again — by keyword, by meaning, or by both — and retrieval-augmented
generation (RAG) is how you put a model in front of that content so it answers questions
grounded in *your* data rather than the model's general training.

This page explains the mental model: what content search is and why it works the way it
does, what the three search modes mean for a developer choosing between them, and how the
three streaming inference surfaces — grounded answers over your whole corpus, single-
document Q&A, and stateless chat — fit together. For runnable guides see
[how-to.md](how-to.md); for the exhaustive field-and-limit list see
[reference.md](reference.md).

## One search surface over everything you store

Vectros indexes two kinds of content into one searchable surface: **documents** (text you
ingest directly or upload as a file, which Vectros extracts and chunks) and **records**
(typed, schema'd entities). A single search call queries both at once. Each result carries
a `sourceType` discriminator so you can tell a document hit from a record hit and branch
your rendering accordingly, but you do not run two searches and merge them yourself —
unified, cross-content search is the default, and you narrow to one kind only when you want
to.

What flows into the index is governed by your schema. For records, only fields you declare
searchable participate in keyword relevance; for documents, the title and the extracted
body text are indexed. This matters in two directions. First, a query that only matches a
non-searchable field returns nothing — that is the designed behavior, not a bug. Second,
and importantly for regulated workloads, **fields you mark sensitive never enter the search
index at all.** Sensitive data is excluded at index time, so it cannot be surfaced,
ranked, or leaked through a search query under any scope. (This index-time exclusion is one
of three independent protections for sensitive fields; the other two — redaction at write
time and masking on read — are covered in the security and compliance docs.)

Every searchable row also carries its ownership and tenancy by construction. A search is
always bounded to the caller's context first, and a scoped key narrowed to a particular
user, organization, or client is enforced against the content itself — not bolted on as a
filter the caller could widen. You cannot inject a tenancy or ownership filter through the
search request to see content outside your scope.

## The three search modes

Vectros runs two complementary retrieval strategies and lets you choose how much of each
you want per query. The `mode` parameter selects one of three behaviors.

**`TEXT` — keyword relevance.** Classic keyword matching: a result scores higher when it
contains the query's terms, downweighted for terms that appear everywhere, and length-
normalized so a short record does not lose to a long document just for being short. Text
mode is fast and cheap, and it is the right choice for exact-phrase lookups, known
identifiers, boolean term logic, and any case where the user is typing words they expect to
appear verbatim. It returns highlighted snippets so a UI can show *why* each result
matched.

**`SEMANTIC` — meaning-based similarity.** Instead of matching words, semantic mode matches
*meaning*. The query and the indexed content are compared as embeddings, so a search for
"exposure therapy for panic" can surface a passage about treating panic disorder with
graded exposure even when the exact words differ. Semantic mode is the right choice for
natural-language questions and conceptual recall, where the user cares about the idea, not
the vocabulary. It also returns the surrounding passage around each match (see *Grounding
context* below), which is what makes it useful for feeding a model.

**`HYBRID` — fused ranking (the default, and the recommended choice).** Hybrid runs both
strategies and fuses their rankings into one result list. The two cover each other's
weaknesses: keyword relevance is precise but brittle — it misses content phrased
differently from the query — while semantic similarity understands meaning but over-
retrieves loosely related material. Fusing them on *rank position* (not on raw scores,
which are not comparable between the two strategies) yields a result set that holds up
across exact lookups, natural-language questions, and the messy mix of the two that real
users actually type. A hit that both strategies rank highly rises to the top. For most
applications, hybrid is simply the best default; reach for `TEXT` or `SEMANTIC` only when
you have a specific reason to want one strategy alone.

A practical consequence worth internalizing: a document indexed for one strategy only will
not appear in a search that requires the other. Indexing mode is a property of the content,
chosen when you ingest it or define its schema; search mode is a property of the query.
They have to line up for a result to surface.

## Grounding context: chunks and their surrounding passage

When Vectros indexes a document for semantic search, it splits the text into small chunks
so that a match is *precise* — a query lands on the specific passage that is relevant, not
on the whole document it was buried inside. But a small chunk is often too little context
for a model to reason over confidently. So each result can carry two pieces of text: the
specific chunk that matched (`chunkText`) and a larger surrounding passage that contains it
(`contextText`). The wider passage is what you feed a model when you want it to ground an
answer without losing the thread — it is the difference between handing the model a
sentence and handing it the paragraph that sentence lives in.

This is why search and RAG are the same machinery viewed from two angles. Search returns
ranked hits with their grounding context; RAG runs that exact search, takes the grounding
context, and streams a model's answer built on top of it.

## Resilience: graceful degradation

A hybrid search runs its two strategies independently. If one becomes temporarily
unavailable, the request does not fail — it returns results from the surviving strategy and
flags the response as `degraded`, naming which leg was missing. Your results may be less
complete than a full hybrid search, but you still get an answer. If completeness matters
more than availability for a given call, you can opt into fail-closed behavior so that a
degraded search is rejected outright rather than returning a partial set. Single-mode
`TEXT` and `SEMANTIC` searches never degrade — their single strategy *is* the request.

## Freshly indexed content: when a new document becomes searchable

Indexing reports completion as a single `INDEXED` status, but the two search strategies do
not become queryable at the same instant. Understanding this avoids a surprising "I just
indexed it and search can't find it" moment right after a write.

When a document reaches `INDEXED`, indexing is **complete** — both the keyword index and the
semantic (vector) index have been written. The keyword strategy is **immediately
consistent**: a document is matchable by `TEXT` the moment it is `INDEXED`. The semantic
strategy is **eventually consistent for queries**: the vectors are durably written at
`INDEXED`, but they become *queryable* a short time later — typically a second or two, and
longer (sometimes ten seconds or more) when the system is under heavy indexing load. So for
a few seconds after `INDEXED`, a brand-new document is findable by its words but not yet by
its meaning.

Vectros softens this seam rather than letting a fresh document vanish. Because hybrid and
RAG searches include the keyword strategy, a freshly indexed document that **matches on
text** surfaces in `HYBRID` and `RAG` results at keyword speed — you do not have to wait for
the vector index to catch up. Once the vectors become queryable, the document's semantic
score fills in and its ranking settles to its steady-state position. The one case with no
fast path is a query that can *only* match a brand-new document semantically (no shared
keywords): a pure-`SEMANTIC` search, or a hybrid query whose only link to the document is
meaning rather than words, may not return it for those first few seconds.

The practical guidance: if your workflow writes a document and then immediately searches for
it, prefer a query that shares words with the document, or briefly retry — the visibility
window is short, and text-matched documents are never invisible to hybrid or RAG in the
meantime.

## Grounded answers: the three inference surfaces

Vectros exposes three ways to put a model in front of content. All three stream their
responses back as Server-Sent Events (SSE) over one connection, all three run the model
inside the Vectros perimeter, and all three share the same sequence of pre-flight checks so
that failures look the same regardless of which one you called.

**Grounded answers over your corpus (RAG).** You pass a question; Vectros runs a search
across your indexed content, emits the matched results to you as a citation event *before
any text is generated*, then streams the model's answer grounded on those results. This is
the integrated retrieve-then-generate path: you get both the citations (so your UI can show
the user what the answer was built on) and the answer, in one call. Retrieval reuses the
same search engine described above, with the same modes, filters, and scope enforcement.

**Single-document Q&A.** You pass a question and one document id; Vectros loads that one
document's full extracted text and streams an answer scoped to it. There is no retrieval
step — the whole document is the context. This is the right surface when the user is
already looking at a specific document and wants to ask about *it*, and it is bounded by a
hard input-size cap (roughly 25 pages of text). For questions that span many documents, use
RAG instead, whose retrieval step picks the relevant passages across the whole corpus.

**Stateless chat.** A plain single-turn completion: you pass a message array, Vectros
streams the model's reply. Chat does not retrieve anything and does not store any
conversation state — it is stateless by design. If you want multi-turn behavior, your
application keeps the history and re-sends the full message array on each turn. Chat is the
right surface when you are managing your own context (perhaps assembled from your own
searches) and just want a model to complete it.

### In-perimeter inference

Whichever surface you call, the model runs against AWS-hosted models from inside the
Vectros AWS account. A prompt — a chat message, a RAG query with its retrieved passages, or
a single document's text — does not cross out to a third-party model vendor's API. For
teams building on regulated data, this is the architectural payoff: content in a prompt and
content in a retrieved passage stay inside the same perimeter the rest of the platform runs
in. This in-perimeter guarantee applies to the partner data plane — the data you store and
retrieve through Vectros. (The precise terms of compliance coverage are addressed in the
security and compliance documentation, not here.)

By default, inference is served from a US region — the fail-closed default for every
tenant. A tenant that is entitled to it (via a signed global-processing waiver) can opt an
individual request into a lower-cost global region by sending `allowGlobalRegion: true`;
without that entitlement the flag is rejected rather than honored. Region choice affects
only where the model runs and the price, never which content is retrieved. See the
[reference](./reference.md#region-serving-allowglobalregion) for the field and its error.

### The model catalog is plan-gated

Each inference call resolves a model. The set of models a given key can reach is governed
by the plan tier — a lighter, fast model is available on every tier, and more capable
models are available on higher tiers. You can list the catalog the calling key can reach,
and if you request a model your plan does not include, the call is rejected with a clear
"upgrade" signal rather than silently substituting one. Omit the model and Vectros falls
back to a sensible default that every tier can reach — so a brand-new developer can make a
working call without configuring anything first.

### Why pre-flight checks, and in what order

Before any model is invoked, each inference request runs a fixed sequence of checks:
permission (does this key's scope permit inference at all), monthly plan allowance, request-
rate ceiling, and inference balance. The order is deliberate — the cheapest checks run
first, so a request that would fail on balance never pays for a search round-trip, and a
request that would fail on permission never touches your usage counters. The token cost of a
call is metered and recorded only after the stream finishes, and the terminal event of every
stream reports exactly what the call cost.

## Where to go next

- [how-to.md](how-to.md) — runnable guides: hybrid search with filters, a grounded RAG
  answer consuming the stream and citations, single-document Q&A, and a stateless chat call.
- [reference.md](reference.md) — every parameter, field, mode, limit, error code, and the
  honest "what this does not do" notes for search and inference.
- [../data-model/explanation.md](../data-model/explanation.md) — schemas, records, and the
  searchable / sensitive field declarations that govern what search sees.
- [../operations-trust/compliance.md](../operations-trust/compliance.md) — the three
  independent sensitive-field protections, tenant and context isolation, and the
  in-perimeter inference posture in full.