# Search & RAG — how-to guides

Goal-oriented, runnable guides for searching your content and getting grounded answers
from a model. Every snippet here uses real SDK calls that pass against a live Vectros
environment. All data is **synthetic** — fictional names and values only.

> **SDK version.** These guides use the Node SDK. The API spec is at `0.29.9`; anything
> that needs a newer client is marked inline. Nothing on this page requires 0.26, so the
> core calls work on any current client — including the 0.23 staging build the React
> toolkit and reference apps pin, and the 0.26 build the CLI and MCP server bundle. The
> optional global-region inference flag (`allowGlobalRegion`, below) is a 0.27 addition.

## Client setup

All guides assume a constructed client. The constructor takes a token and an environment
(the API base URL):

```ts
import { VectrosClient } from '@vectros-ai/sdk';

const client = new VectrosClient({
  token: process.env.VECTROS_API_KEY!,          // sk_*, ssk_*, or st_*
  environment: process.env.VECTROS_API_BASE_URL!, // e.g. https://api.vectros.ai
});
```

Sub-clients are grouped by area: `client.search.*`, `client.inference.*`,
`client.documents.*`, `client.records.*`, `client.schemas.*`, `client.folders.*`.

A note on **indexing latency.** Writes are indexed asynchronously — a document or record is
not searchable the instant the write call returns. In application code you typically do not
wait; you write, and the content becomes searchable shortly after. In tests and demos, poll
until the content surfaces (search by a unique marker phrase, or scope retrieval with
`createdAfter` to your run's start time) rather than sleeping a fixed interval.

---

## Search your content with hybrid mode and filters

**Goal:** find content across documents and records, then narrow with filters and read the
results.

**Prerequisites:** a key with `search:r`; some indexed content.

### Steps

A unified hybrid search — the default mode, querying both documents and records:

```ts
const results = await client.search.content({
  query: 'lifestyle changes for stage 1 hypertension',
  mode: 'HYBRID',     // 'TEXT' | 'SEMANTIC' | 'HYBRID' (default HYBRID)
  limit: 10,
});

for (const hit of results.results ?? []) {
  console.log(hit.sourceType, hit.documentId, hit.score);
  console.log('matched chunk:', hit.chunkText);
  console.log('grounding context:', hit.contextText);
}
console.log(`${results.totalResults} total, ${results.searchTimeMs} ms`);
```

Each hit carries a `sourceType` discriminator — branch on it to tell a document hit from a
record hit (the literal values the API returns are the strings `"PartnerDocument"` and
`"GenericRecord"`) — the source `documentId` (use it with `getDocument` / `getRecord`), the fused
`score`, the matched `chunkText`, the wider `contextText`, a highlighted `snippet`, and the
`metadata` supplied at ingest time.

**Narrow to one content type** with `contentTypes`:

```ts
// Documents only
const docsOnly = await client.search.content({
  query: 'ACE inhibitor dosing',
  mode: 'HYBRID',
  contentTypes: ['documents'],
});

// Records only
const recordsOnly = await client.search.content({
  query: 'ACE inhibitor dosing',
  mode: 'HYBRID',
  contentTypes: ['records'],
});
```

**Narrow records to one schema type** with `typeName` (this implicitly restricts to records):

```ts
const patientRecords = await client.search.content({
  query: 'follow-up scheduled',
  mode: 'TEXT',
  typeName: 'patient_visit',   // only records of this schema type
  limit: 50,
});
```

**Scope to a folder** (exact folder), or to a folder and all its descendants:

```ts
const inFolder = await client.search.content({
  query: 'intake notes',
  mode: 'HYBRID',
  folderId: '<folder-uuid>',        // this exact folder only
  // rootFolderId: '<folder-uuid>', // this folder AND all descendants
});
```

**Filter by ownership** (user / org / client) and **by metadata fields**:

```ts
const scoped = await client.search.content({
  query: 'medication review',
  mode: 'HYBRID',
  clientId: '<client-uuid>',                 // ownership scope
  filters: {
    status: 'open',                          // equality
    tag: ['anxiety', 'depression'],          // OR-set: match any
    visitCount: { $gte: 2, $lte: 10 },       // closed range
  },
});
```

Filter values may be a scalar (equality), an array (match any), or an operator map
(`$eq`, `$ne`, `$gt`, `$gte`, `$lt`, `$lte`, `$in`, `$nin`). Metadata filter keys must be
fields declared filterable on the schema; you cannot use the `filters` map to inject a
tenancy or ownership key to widen your access — those are enforced separately.

**Expected result:** a `results` array plus `totalResults`, `searchTimeMs`, and the
`degraded` / `degradedLegs` fields. An empty result set for a query that matches nothing is
a normal `200` with `results: []` and `totalResults: 0` — not an error.

### Paging search results

`search.content` is **not** wrapped in the cursor envelope that list and lookup endpoints
use — it has no `nextCursor`. Page with `limit` and `offset` instead:

```ts
const page1 = await client.search.content({ query, mode: 'TEXT', limit: 20, offset: 0 });
const page2 = await client.search.content({ query, mode: 'TEXT', limit: 20, offset: 20 });
```

`limit` is capped at 100 and `offset` at 200 (an `offset` above 200 is rejected with
`400`). Consecutive pages are disjoint. For pulling a known set of recent content
deterministically — or for reaching past the 200-row offset ceiling — prefer a
`createdAfter` window over deep `offset` paging.

### Collapse multiple chunks of the same document

A long document can match on several of its chunks and produce several hits with the same
`documentId`. To get at most one hit per source document, set `uniqueDocuments: true`:

```ts
const collapsed = await client.search.content({
  query: 'hypertension monitoring',
  mode: 'TEXT',
  uniqueDocuments: true,   // at most one hit per source document
  limit: 50,
});
```

---

## Get a grounded answer over your whole corpus (RAG)

**Goal:** ask a natural-language question and stream back a model answer grounded on your
indexed content, with citations.

**Prerequisites:** a key with `inference:r` (and `search:r` for the retrieval step); some
indexed content; a positive inference balance on balance-mode plans.

### Steps

`ragInference` returns an async iterable of SSE events. The first informative event is
`search_results` (the citations), followed by `content_delta` events (the streamed answer),
ended by a single `done` event:

```ts
const stream = await client.inference.ragInference({
  query: 'What treatment is recommended for stage 1 hypertension?',
  model: 'claude-sonnet-4-5',     // optional; omit for the tier default
  search: {
    mode: 'HYBRID',
    limit: 5,                     // retrieval topK (default 10, max 50)
    // narrow retrieval the same way you narrow search:
    // clientId, userId, orgId, folderId, typeName, createdAfter, createdBefore
  },
  maxTokens: 512,                 // capped at 4096 for RAG
});

let answer = '';
let citations: Array<{ documentId: string; chunkText: string }> = [];

for await (const event of stream) {
  switch (event.event) {
    case 'search_results':
      citations = event.results;          // show these so the user can verify grounding
      break;
    case 'truncation_warning':
      // some retrieved passages were dropped to fit the context budget
      console.warn('grounding truncated:', event.reason);
      break;
    case 'content_delta':
      answer += event.delta;              // append each chunk
      break;
    case 'done':
      console.log('tokens:', event.inputTokens, '→', event.outputTokens);
      console.log('charged:', event.inferenceBalanceCentsCharged, 'cents');
      break;
    case 'error':
      throw new Error(event.message);
  }
}
```

Each entry in `search_results.results` carries `documentId`, `score`, `chunkText`,
`contextText`, `snippet`, `metadata`, `sourceType`, `typeName`, and `createdAt` — the same
shape as a search hit, so you can render the citations exactly as you would render search
results. UIs typically show the citations above the answer so the user can verify the
grounding while the model is still generating.

If retrieval finds nothing, the `search_results` event still fires (with an empty
`results` array) and the model still answers — typically stating that it found no relevant
content. The contract is that `search_results` is always emitted, even when empty.

**Refuse a partial answer.** If you would rather fail than ground on a degraded retrieval,
set `search.requireComplete: true`. The call then returns an error **before the stream
opens** when a search leg is unavailable, instead of grounding on partial results.

**Expected result:** the assembled `answer` string, a `citations` array you can display,
and a per-call cost on the `done` event.

### Scoping RAG retrieval with a scoped token

A scoped token's data scope is enforced on the retrieval step, so RAG over a scoped token
only grounds on content that token is allowed to see. Mint a token scoped to a user and
RAG through it:

```ts
const minted = await client.auth.mintToken({
  userId: '<user-uuid>',
  scope: {
    allowedActions: ['inference:r', 'search:r'],
    dataScope: { userId: ['<user-uuid>'] },
  },
});

const scoped = new VectrosClient({
  token: minted.token,
  environment: process.env.VECTROS_API_BASE_URL!,
});

const stream = await scoped.inference.ragInference({
  query: 'What treatments are discussed?',
  search: { mode: 'HYBRID', limit: 10, userId: '<user-uuid>' },
  maxTokens: 256,
});
```

Because the data scope lists a single `userId` with no `null` sentinel, every call must
include the matching `userId`. To also reach tenant-level (owner-less) content under the
same token, include `null` in the scope list: `dataScope: { userId: ['<user-uuid>', null] }`.

---

## Ask a single document

**Goal:** ask one question about one specific document and stream the answer.

**Prerequisites:** a key with `inference:r`; the document's id; the document indexed with
its text stored.

### Steps

`documentAsk` takes the document `id` and a `prompt` in the request body. It streams a
`document_context` event first (the document it loaded), then the answer:

```ts
const stream = await client.inference.documentAsk({
  id: '<document-uuid>',
  prompt: 'Which medications does this document describe, and how do they work?',
  model: 'claude-haiku-4-5',   // optional; omit for the tier default
  maxTokens: 256,              // output cap 8192
});

let answer = '';
for await (const event of stream) {
  switch (event.event) {
    case 'document_context':
      console.log('asking against:', event.documentId, event.title, `${event.textBytes} bytes`);
      break;
    case 'content_delta':
      answer += event.delta;
      break;
    case 'done':
      console.log('charged:', event.inferenceBalanceCentsCharged, 'cents');
      break;
  }
}
```

**Handle the oversize case.** A document larger than the input cap (~25 pages, 32,000
estimated input tokens) is rejected with a `413` **before the stream opens** — and *before
any credits are charged*. The error payload carries `estimatedTokens` and `limitTokens` so
you can branch and re-route to RAG:

```ts
try {
  const stream = await client.inference.documentAsk({ id, prompt, maxTokens: 256 });
  // ... consume stream
} catch (err: any) {
  if (err.statusCode === 413) {
    // too large for single-document ask — use corpus RAG instead
    return askViaRag(prompt);
  }
  throw err;
}
```

**A not-yet-ready document returns `409`.** A document that is not yet fully indexed, that
failed ingest, or that was ingested without its text stored returns a `409` before the
stream opens. Asking a document the instant after you ingest it commonly hits this — wait for
indexing to complete, and ingest with the text stored if you intend to ask against it.

**Out-of-scope and missing documents both return `404`.** Asking about a document that does
not exist, that belongs to another tenant, or that your token's scope cannot see all return
the same `404` — the endpoint never reveals whether a document exists outside your scope.

**Expected result:** a streamed `answer` grounded only on that one document, or a structured
`413` (too large) or `404` (not visible to you).

---

## Make a stateless chat call

**Goal:** a single-turn model completion you manage the context for yourself.

**Prerequisites:** a key with `inference:r`; a positive inference balance on balance-mode
plans.

### Steps

`chatInference` takes a `messages` array. A `system` message sets the system prompt; the
rest is passed through. It streams `content_delta` events then a `done`:

```ts
const stream = await client.inference.chatInference({
  messages: [
    { role: 'system', content: 'You are a concise clinical scribe.' },
    { role: 'user',   content: 'Summarize this visit note in three bullets: ...' },
  ],
  model: 'claude-haiku-4-5',  // optional
  maxTokens: 512,             // output cap 8192
});

let reply = '';
for await (const event of stream) {
  if (event.event === 'content_delta') reply += event.delta;
  if (event.event === 'done') {
    console.log('model:', event.model);
    console.log('tokens:', event.inputTokens, '→', event.outputTokens);
  }
}
```

**Multi-turn is your responsibility.** Chat stores nothing. To carry a conversation,
append the model's reply to your `messages` array and re-send the whole array on the next
turn:

```ts
const history = [
  { role: 'user',      content: 'My care plan emphasizes lifestyle changes.' },
  { role: 'assistant', content: 'Noted — lifestyle-first care plan.' },
  { role: 'user',      content: 'What did I just say my care plan emphasized?' },
];
const stream = await client.inference.chatInference({ messages: history, maxTokens: 64 });
```

**Expected result:** the streamed `reply` and a `done` event with token counts and the
resolved model id.

---

## List the models your key can reach

**Goal:** discover which inference models the calling key's plan permits, before you call.

```ts
const catalog = await client.inference.listInferenceModels();

console.log('default model:', catalog.defaultModel);
for (const m of catalog.models) {
  console.log(m.id, '— context window', m.contextWindow, '— plans:', m.availableOn);
}
```

Each entry carries the alias `id` (e.g. `claude-haiku-4-5`), a display `name`, the
`provider`, the `contextWindow`, per-1k-token credit rates, and `availableOn` (the plan
tiers that may call it). `defaultModel` is what an inference call resolves to when you omit
`model` — and it is reachable on the free plan, so a brand-new key can make a working call
immediately.

---

## Drive search and RAG from an AI agent (MCP)

If you are building with an agent over the Model Context Protocol rather than calling the
SDK directly, the Vectros MCP server exposes the same capabilities as tools:

- **`hybrid_search`** — wraps content search. Same modes, filters, ownership scope, folder
  scope, and `uniqueDocuments` / `minSimilarity` knobs. (The MCP tool caps results lower
  than the API — default **3**, max 10 — to protect the agent's context window; paginate
  with `offset` for more.)
- **`rag_ask`** — wraps corpus RAG. The agent gets the assembled answer plus the citations
  and usage in one tool result; progress notifications keep the call alive during
  generation. (Its retrieval defaults to **5** results, max 10 — also lower than the API.)

> **Why agent results can look sparse.** The MCP tools deliberately default to far fewer
> results than the API (`hybrid_search` and `record_query` default to 3, `rag_ask`
> retrieval to 5; the API defaults are 10–100). This protects the agent's context window —
> it is **not** a bug. Raise each tool's `limit` (up to its max of 10) when an agent needs
> wider recall. The full per-tool cap table is in [clients/mcp.md](../clients/mcp.md).
- **`document_ask`** — wraps single-document Q&A, including the structured oversize signal.

The agent never sees content outside the scoped key the MCP server is configured with —
the same scope enforcement applies. See the blueprint walkthroughs for the end-to-end
no-code agent path.

## Where to go next

- [reference.md](reference.md) — every parameter, field, limit, and error code for search
  and the three inference surfaces.
- [explanation.md](explanation.md) — the concepts: why the three modes exist, grounding
  context, and how search and RAG are the same machinery.
- [../data-model/how-to.md](../data-model/how-to.md) — defining schemas with searchable and
  sensitive fields, and writing the records you search over.
- [../identity-access/how-to.md](../identity-access/how-to.md) — minting scoped tokens and
  the data-scope / null-sentinel rules the RAG scoping example relies on.