# Search & RAG — reference

Exhaustive reference for content search and the three streaming inference surfaces:
every parameter, field, mode, limit, error code, the stream event vocabulary, and an honest
"Notes & limits" section stating what each feature does not do.

For request/response wire detail at the endpoint level, see the generated **API reference**
(the OpenAPI / Scalar spec). This page documents the SDK-level surface and the behavior
that the spec alone does not capture.

> **Version note.** The API spec is at `0.29.9`. Nothing on the search side of this page is
> 0.26-only, so it works on any current client. The optional inference flag
> `allowGlobalRegion` (region serving, below) is a 0.27 addition. Where the SDK method name
> differs from a raw field name, the SDK name is given.

---

## Content search — `client.search.content(req)`

Unified search across documents and records. Requires the `search:r` scope.

### Request parameters

| Parameter | Type | Default | Notes |
|---|---|---|---|
| `query` | string | — (required) | Natural-language or keyword query. |
| `mode` | `'TEXT' \| 'SEMANTIC' \| 'HYBRID'` | `HYBRID` | Keyword relevance / meaning-based similarity / fused ranking. |
| `limit` | integer | 20 | Max results. Valid range **1–100**; out-of-range is rejected with `400`. |
| `offset` | integer | 0 | Results to skip — paging (see *Paging* below). |
| `contentTypes` | `('documents' \| 'records')[]` | both | Narrow to one content type. Omitted, empty, or both values ⇒ unified. |
| `typeName` | string | — | Restrict record hits to one schema type. Implicitly narrows to records (skips documents) unless `contentTypes` includes documents. |
| `folderId` | string (uuid) | — | Restrict to content in this exact folder. |
| `rootFolderId` | string (uuid) | — | Restrict to this folder and all descendants. Use instead of `folderId`. |
| `userId` | string (uuid) | — | Ownership filter — content owned by this user. |
| `orgId` | string (uuid) | — | Ownership filter — content of this organization. |
| `clientId` | string (uuid) | — | Ownership filter — content of this client. |
| `filters` | object | — | Field-level metadata filters; AND-combined across keys. See *Filter grammar*. |
| `createdAfter` | string (ISO-8601) | — | Content created at/after this UTC timestamp. |
| `createdBefore` | string (ISO-8601) | — | Content created at/before this UTC timestamp. |
| `uniqueDocuments` | boolean | false | When true, at most one hit per source document. |
| `minSimilarity` | number 0.0–1.0 | — | Minimum semantic similarity; hits below are excluded (semantic / hybrid). |
| `minTextRelevance` | number 0.0–1.0 | — | Relative keyword-relevance floor, as a fraction of the top hit's score (e.g. 0.5 keeps hits at least half as relevant as the best). Applies to `TEXT` / `HYBRID`. Omit or ≤0 keeps all. |
| `textMode` | `'OR' \| 'AND' \| 'PHRASE' \| 'COMPLEX'` | `OR` | Keyword sub-mode (see *Keyword sub-modes*). Applies to `TEXT` / `HYBRID`. |
| `slop` | integer ≥0 | 0 | Phrase-match slop: intervening positions tolerated between terms when `textMode='PHRASE'`. Ignored otherwise. |
| `requireComplete` | boolean | false | Fail-closed override: return `503` instead of degraded partial results when a leg is unavailable. |

### Keyword sub-modes (`textMode`)

For `TEXT` and `HYBRID` searches, `textMode` controls how query terms combine in the
keyword leg:

- **`OR`** (default) — match any term; broadest recall.
- **`AND`** — require all terms; higher precision.
- **`PHRASE`** — require terms as a contiguous sequence (tunable with `slop`).
- **`COMPLEX`** — full keyword query syntax (boolean operators, field-scoped clauses, range
  filters). Use only when you need expression-level control; the query is parsed as a
  structured expression rather than a bag of terms.

### Filter grammar (`filters`)

Each top-level key is a metadata field declared filterable on the schema (or a built-in
document field). Top-level keys are AND-combined. Each value is one of:

- **Scalar** (string / number / boolean) — equality, e.g. `{ "status": "open" }`.
- **Array of scalars** — OR-set (match any), e.g. `{ "tag": ["red", "blue"] }`.
- **Operator map** — closed set of operators:
  - Scalar operand: `$eq`, `$ne`, `$gt`, `$gte`, `$lt`, `$lte`. Operators in one map are
    AND-combined, so `{ "price": { "$gte": 100, "$lte": 500 } }` is a closed range.
  - Array operand: `$in`, `$nin`. Cannot be combined with other operators in the same map.

Numbers and booleans match typed (the field must have been ingested under a typed schema).
Dates may be ISO-8601 strings or epoch millis. **Filter keys are validated** (`^[A-Za-z_]
[A-Za-z0-9_-]*$`); unknown operators, non-scalar operands, malformed keys, and any attempt
to filter on a reserved tenancy/ownership key are rejected with `400`. You cannot widen your
access through the filter map — ownership scope is enforced separately.

### Response shape

`search.content` returns a flat object (it is **not** wrapped in the `{ data, nextCursor }`
cursor envelope used by list/lookup endpoints — see *Paging*):

| Field | Type | Notes |
|---|---|---|
| `results` | `SearchResult[]` | Matched chunks, ranked (highest `score` first). May be empty. |
| `totalResults` | integer | Total matches found. `0` for a miss. |
| `searchTimeMs` | integer | Server-side execution time. Reported even on an empty result. |
| `degraded` | boolean | True when one leg was unavailable and results came from the survivor only. |
| `degradedLegs` | `string[]` | Which legs were unavailable: `"text"` (keyword) and/or `"vector"` (semantic). Empty when not degraded. |

Each `SearchResult`:

| Field | Type | Notes |
|---|---|---|
| `documentId` | string (uuid) | Source entity id. Use with `getDocument` / `getRecord`. (This is the source id, not an internal index id.) |
| `sourceType` | `'PartnerDocument' \| 'GenericRecord'` | Document vs. record discriminator — the two literal strings the API returns; branch on it when rendering mixed results. |
| `score` | number | Fused relevance score; primary sort key, higher is more relevant. |
| `textScore` | number | Keyword (relevance) sub-score. Non-zero in `HYBRID` when the keyword leg contributed. **Always 0 in `TEXT`-only mode** — that path ranks by result-array order, not by an exposed per-hit score. |
| `semanticScore` | number | Semantic similarity sub-score. Non-zero in `SEMANTIC` / `HYBRID`. |
| `chunkText` | string | The specific chunk that matched. Feed this (or `contextText`) to a model. |
| `contextText` | string | The wider surrounding passage containing the chunk — better grounding context. |
| `snippet` | string | Highlighted excerpt with query terms emphasized, for display. May be null for a semantic-only hit (use `chunkText`). |
| `metadata` | object | Metadata supplied at ingest (title, folderId, custom fields). |
| `createdAt` | string (ISO-8601) | Source content creation time. |

### Paging

`search.content` has **no `nextCursor`** — it is not enveloped. Page by combining `limit`
with `offset`:

```ts
const page1 = await client.search.content({ query, mode, limit: 20, offset: 0 });
const page2 = await client.search.content({ query, mode, limit: 20, offset: 20 });
```

`limit` is capped at 100; consecutive pages are disjoint. This differs from list/lookup
endpoints, which return `{ data, nextCursor }` and are drained by feeding `nextCursor` back
as `startFrom`. For pulling recent content deterministically, prefer a `createdAfter`
window over deep `offset` paging.

### Notes & limits — search

- **Index/search mode must line up.** Content indexed for one strategy only will not appear
  in a search requiring the other. Indexing mode is a property of the content (set at
  ingest / on the schema); search mode is a property of the query.
- **Only searchable fields participate in keyword relevance.** A query matching only a
  non-searchable field returns nothing — by design.
- **Sensitive fields never enter the index.** They cannot be searched, ranked, or surfaced
  under any scope (index-time exclusion).
- **`textScore` is 0 in `TEXT`-only mode** — that path returns highlighted snippets but does
  not expose a per-hit keyword score; rank order is the signal.
- **`minTextRelevance` applies only to `TEXT` / `HYBRID`**; `minSimilarity` applies only to
  `SEMANTIC` / `HYBRID`.
- **A miss is a `200`, not a `404`.** Empty `results`, `totalResults: 0`.
- **Degradation is silent unless you check.** Inspect `degraded` / `degradedLegs`, or set
  `requireComplete: true` to turn a degraded leg into a `503`.
- **`INDEXED` ≠ instantly queryable by every mode.** `INDEXED` means indexing is *complete*.
  The keyword index is immediately consistent — a document is matchable by `TEXT` the moment
  it is `INDEXED`. The semantic (vector) index is **eventually consistent for queries**: the
  vectors are durable at `INDEXED` but become queryable a short time later (≈1–2 s, longer
  under heavy indexing load). A freshly indexed document that **matches on text** still
  surfaces in `HYBRID` and `RAG` at keyword speed (its semantic score fills in once the
  vectors catch up); a query that can match a brand-new document *only* semantically
  (pure-`SEMANTIC`, or a hybrid query sharing no words with it) may not return it for those
  first few seconds. If you search immediately after a write, share words with the document
  or briefly retry. See [explanation.md](explanation.md) § "Freshly indexed content".
- **Cross-content search returns mixed types.** Always branch on `sourceType` when rendering
  unified results.

---

## Inference surfaces — `client.inference.*`

Three streaming surfaces, all returning an async iterable of SSE events, all requiring the
`inference:r` scope, all sharing one pre-flight check sequence. Inference runs against
AWS-hosted models inside the Vectros perimeter (in-perimeter for the partner data plane).

### Streaming model (shared)

Each surface returns an async iterable; iterate it to consume events in arrival order. Every
event carries an `event` field naming its type, so a consumer can dispatch on one key. On the
wire it is standard Server-Sent Events (`event: <type>` / `data: <json>` framed by a blank
line) — any compliant SSE reader works; the SDK presents it as an async iterator.

Shared event vocabulary:

| Event | Fields | When |
|---|---|---|
| `content_delta` | `delta` (string) | One chunk of generated text. Append each to build the answer. |
| `done` | `inputTokens`, `outputTokens`, `model`, `platformCreditsCharged`, `inferenceBalanceCentsCharged`, optionally `cacheReadTokens` / `cacheCreateTokens` | Terminal event with token counts, resolved model id, and per-call cost. Exactly one. |
| `error` | `message`, `code` | A mid-stream model failure. |

Surface-specific events are listed under each surface below.

### Pre-flight checks (shared, fixed order)

Run before any model invocation; cheaper checks first:

1. **Action scope → `403`.** A scoped token must carry `inference:r` (or a wildcard). Root
   `sk_*` keys carry wildcard scope and pass by construction. The `403` does not enumerate
   the missing action.
2. **Monthly credit limit → `402`.** Once the period's cumulative credits exceed the plan's
   ceiling, inference rejects until the period rolls or the plan is upgraded.
3. **Burst rate limit → `429`.** Per-tenant request-rate ceiling, scaling with plan tier.
4. **Inference billing gate → `402`.** In **balance** mode (default on lower tiers), a
   per-partner pre-funded balance must be positive, else `402 Insufficient inference
   balance`. In **usage** mode (Enterprise-shaped, post-billed), accumulated usage is checked
   against a contractual cap, else `402` when the cap is reached.

The token cost of a call is metered and recorded on stream finalization (after the stream
closes, or on a broken pipe with partial output). A transient accounting failure does not
fail the in-flight response — the balance on the next call may briefly lag.

---

### Grounded corpus answers — `client.inference.ragInference(req)`

Retrieve-then-generate over your indexed content.

**Request parameters:**

| Parameter | Type | Default | Notes |
|---|---|---|---|
| `query` | string | — (required) | The question to ground and answer. |
| `model` | string | tier default | Model alias. See *Model catalog*. |
| `maxTokens` | integer | 1024 | Output cap. **Capped at 4096** (tighter than chat — retrieved context shares the input budget). |
| `temperature` | number | 0.3 | Sampling temperature. |
| `instructions` | string | — | Optional extra instructions for the answer. |
| `search` | object | — | Retrieval params (below). |

**`search` sub-object** mirrors content search: `mode` (default `HYBRID`), `limit`
(**default 10, capped at 50** — this is the RAG topK), `userId`, `orgId`, `clientId`,
`folderId`, `rootFolderId`, `typeName`, `filters`, `contentTypes`, `createdAfter`,
`createdBefore`, `requireComplete`.

**Event sequence:** `search_results` → optional `truncation_warning` → `content_delta`+ →
`done`.

| Event | Fields | Notes |
|---|---|---|
| `search_results` | `results[]`, `totalResults`, `searchTimeMs`, `degraded`, `degradedLegs` | Always emitted, even when `results` is empty. Each entry: `documentId`, `score`, `textScore`, `semanticScore`, `chunkText`, `contextText`, `snippet`, `metadata`, `sourceType`, `typeName`, `createdAt`. These are your citations. |
| `truncation_warning` | `resultsRequested`, `resultsUsed`, `reason` | Emitted before the answer if retrieved passages had to be dropped to fit the context budget. |

**Behavior:**

- With `search.requireComplete: true`, a degraded retrieval leg causes the call to reject
  **before the stream opens** (`503`) instead of grounding on partial results.
- An empty retrieval still emits `search_results` (empty) and still streams an answer
  (typically stating that nothing relevant was found).
- A scoped token's data scope is enforced on the retrieval step.

---

### Single-document Q&A — `client.inference.documentAsk(req)`

Ask one question against one document's full text. No retrieval step.

**Request parameters:**

| Parameter | Type | Default | Notes |
|---|---|---|---|
| `id` | string (uuid) | — (required) | The document to ask against (in the request body). |
| `prompt` | string | — (required) | The question. |
| `model` | string | tier default | Model alias. |
| `maxTokens` | integer | 2048 | Output cap. **Capped at 8192.** |

**Event sequence:** `document_context` → `content_delta`+ → `done`.

| Event | Fields | Notes |
|---|---|---|
| `document_context` | `documentId`, `title`, `textBytes`, `model` | The document loaded, its size, and the resolved model. Fires before any generated text. |

**Errors:**

- **`409` (before the stream opens)** — the document is not askable yet: it is still
  processing (not yet fully indexed), it failed ingest, or it was ingested without its text
  retained (`storeText` was not set, so there is no full text to load). Freshly-ingested
  documents commonly return `409` until indexing completes.
- **`413` (before the stream opens, no credits charged)** — the document's estimated input
  size exceeds the cap (**32,000 input tokens, ~25 pages**). Payload: `message`,
  `estimatedTokens`, `limitTokens`. Branch on this and re-route to RAG.
- **`404`** — the document does not exist, belongs to another tenant, or is out of your
  token's scope. All three return the identical `404`; the endpoint never reveals existence
  outside your scope.

---

### Stateless chat — `client.inference.chatInference(req)`

Single-turn completion. No retrieval, no stored state.

**Request parameters:**

| Parameter | Type | Default | Notes |
|---|---|---|---|
| `messages` | `{ role, content }[]` | — (required) | Conversation. A `system` role message becomes the system prompt; `user` / `assistant` messages pass through. |
| `model` | string | tier default | Model alias. |
| `maxTokens` | integer | 2048 | Output cap. **Capped at 8192.** |
| `temperature` | number | 0.7 | Sampling temperature. |
| `topP` | number | — | Nucleus-sampling parameter (optional). |

**Event sequence:** `content_delta`+ → `done`.

Chat stores nothing. For multi-turn, append the assistant's reply to your `messages` array
and re-send the whole array next turn.

---

### Model catalog — `client.inference.listInferenceModels()`

Lists the models the calling key's plan tier can reach.

**Response:**

| Field | Type | Notes |
|---|---|---|
| `models` | `Model[]` | Available models for this key. |
| `defaultModel` | string | The alias used when a call omits `model`. Reachable on the free plan. |

Each `Model`:

| Field | Type | Notes |
|---|---|---|
| `id` | string | Alias, e.g. `claude-haiku-4-5`, `claude-sonnet-4-5`, `claude-sonnet-4-6`, `claude-opus-4-7`. Matches the model vendor's marketing names. |
| `name` | string | Display name. |
| `provider` | string | Model provider. |
| `contextWindow` | integer | Context window size in tokens. |
| `inputCreditsPer1kTokens` | number | Input token credit rate. |
| `outputCreditsPer1kTokens` | number | Output token credit rate. |
| `availableOn` | `string[]` | Plan tiers that may call this model (e.g. `free`, `starter`, `pro`, `scale`, `enterprise`). |

Requesting a model your plan does not include returns a `402` pointing to upgrade. A lighter
model is available on every tier; more capable models require higher tiers.

---

## Region serving (`allowGlobalRegion`)

All three inference surfaces (chat, RAG, document-ask) accept an optional boolean
`allowGlobalRegion` in the request body.

| Field | Type | Default | Meaning |
|---|---|---|---|
| `allowGlobalRegion` | boolean | tenant residency default | Opt this request into the lower-cost global (non-US) region path. |

- The tenant's **residency default is US serving**, applied when the flag is omitted. US
  serving is the fail-closed default and carries a region premium.
- Setting `allowGlobalRegion: true` lets an **entitled** tenant serve the request from the
  global region at a lower rate. Entitlement is gated on a signed global-processing waiver.
- If `allowGlobalRegion: true` is sent by a tenant that is **not** entitled, the request is
  rejected with `403` — it is not silently downgraded or upgraded. Region choice never
  changes which content is retrieved, only where the model runs and the price.

---

## Error codes (inference)

| Code | Surface | Meaning |
|---|---|---|
| `400` | all | Malformed request (e.g. missing `query` / `messages` / `prompt`, bad filter key). |
| `402` | all | Monthly credit limit exceeded, insufficient inference balance, usage cap reached, or a requested model the plan does not include. |
| `403` | all | Token scope does not permit inference (`inference:r` missing); **or** `allowGlobalRegion: true` was sent but this tenant is not entitled to global-region serving (no signed global-processing waiver). |
| `404` | document-ask | Document not found / cross-tenant / out-of-scope (uniform — existence never revealed). |
| `409` | document-ask | Document not askable yet — still processing (not yet indexed), failed ingest, or text not retained (ingested without `storeText`). Returned before the stream opens. |
| `413` | document-ask | Document exceeds the input-token cap (before the stream opens; no credits charged). |
| `429` | all | Burst rate limit exceeded. |
| `503` | RAG, search | A retrieval/search leg was unavailable and `requireComplete` / `requireComplete: true` was set. |

---

## Notes & limits — inference

- **Hard output caps per surface:** chat 8192, RAG 4096, document-ask 8192 output tokens.
  Document-ask additionally caps **input** at 32,000 tokens (~25 pages) with a `413` before
  the stream opens. A `maxTokens` above a surface's cap is floored to the cap.
- **RAG topK capped at 50.** `search.limit` defaults to 10, max 50 — retrieved context
  shares the model's input budget, so unbounded topK would push the prompt past the context
  window.
- **Chat is stateless; there is no managed conversation state.** No server-side thread
  store, no assistants registry. Multi-turn is the caller's responsibility — re-send the
  `messages` array each turn and budget the history against the model's context window.
- **Document-ask is single-document.** No multi-document Q&A endpoint. For multi-document
  grounding, use RAG (retrieval picks relevant passages across the corpus) or stitch
  multiple `/ask` calls at the application layer.
- **Cost is recorded on finalization.** The `done` event carries `platformCreditsCharged`
  and `inferenceBalanceCentsCharged`; an accounting hiccup will not fail an in-flight
  response, so the next call's balance may briefly lag.
- **Cache-token fields are forward-declared.** `done` may carry `cacheReadTokens` /
  `cacheCreateTokens`; the current billing formula does not yet apply a cache discount.
  Consumers already reading these will see the reduction when it lands, with no code change.
- **In-perimeter scope.** The in-perimeter (no third-party model-vendor egress) guarantee is
  for the partner data plane — the content you store and retrieve through Vectros. Specific
  compliance coverage terms are addressed in the security and compliance documentation, not
  asserted here.
- **The model catalog is the source of truth.** Handlers gate on the live catalog at request
  time, so a model going generally available or being retired takes effect immediately —
  what `listInferenceModels` returns is what the deployed handlers accept.

## Where to go next

- [how-to.md](how-to.md) — runnable guides for each call on this page.
- [explanation.md](explanation.md) — the concepts behind the modes, grounding context, and
  the three inference surfaces.
- [../data-model/reference.md](../data-model/reference.md) — schema field declarations
  (searchable, filterable, sensitive) that govern what search indexes.
- [../operations-trust/compliance.md](../operations-trust/compliance.md) — sensitive-field
  protections, isolation guarantees, and the in-perimeter inference posture.
