# Rate limits

Vectros applies two independent rate limits, each tuned for a different purpose:

1. **A per-account business limit** at the application layer — a fixed number of requests per
   minute, selected from your plan. This is the limit you pace against day to day, and the one
   this page is about.
2. **A per-IP ceiling** at the network edge — a high, plan-independent flood-protection limit.
   A legitimate server-to-server backend never reaches it; it exists to cut off a single-source
   flood before it reaches application code.

The two are enforced at different layers and return slightly different responses (see
[Two layers, two responses](#two-layers-two-responses) below).

---

## The per-account limit

- **Window.** A fixed **one-minute** window. Your request count resets at the top of each
  minute (it is not a rolling window).
- **Scope.** The limit is **per account, shared across every key**. All of your API keys and
  scoped tokens draw from the **same** per-minute budget — issuing more keys does not raise it.
- **What counts.** **Writes, searches, and inference** count against the limit. **Reads
  (`GET` requests) do not** — listing and fetching are not throttled by this limit.

### Limits by plan

| Plan | Requests per minute |
|---|---|
| Free | 60 |
| Starter | 300 |
| Pro | 600 |
| Scale | 1,200 |
| Enterprise | 1,200 (negotiable) |

An unrecognized or unset plan is treated as **Free (60/min)**. Enterprise agreements can raise
the ceiling beyond the Scale default by arrangement.

---

## When you exceed the limit

A request over the limit returns **HTTP `429`** with a JSON body and a set of headers that tell
you exactly how to recover:

```json
{
  "message": "Rate limit exceeded. Please try again later.",
  "errorCode": "RATE_LIMITED",
  "requestId": "…"
}
```

| Header | Meaning |
|---|---|
| `Retry-After` | Seconds to wait before retrying. After this many seconds the window resets and the request will succeed (assuming you are within budget again). |
| `X-RateLimit-Limit` | Your plan's maximum requests per minute. |
| `X-RateLimit-Remaining` | Requests remaining in the current window — always `0` on a `429`. |

`Retry-After` is the value to honor: it is the number of seconds until the current minute window
resets, so a client that sleeps for `Retry-After` seconds lands in a fresh window with a full
budget.

### Handling 429s

- **Honor `Retry-After`.** Sleep for the advertised number of seconds, then retry. This is the
  single most effective thing a client can do.
- **Back off exponentially** if you retry without reading `Retry-After`, and add jitter so a
  fleet of workers doesn't retry in lockstep.
- **Pace bulk work.** A one-shot backfill (for example, ingesting a large corpus) is exactly the
  workload that hits this limit. Spread writes across minutes, or cap concurrency so your steady
  rate stays under your plan's per-minute budget.
- **Branch on the `429` status, not the body.** The status code is the contract; the
  `errorCode` and headers are there to help you recover.

The limiter **fails open**: if the counter store is briefly unavailable, requests are allowed
rather than rejected, so a limiter outage never blocks your traffic.

---

## Two layers, two responses

Because the per-account limit and the per-IP edge ceiling live at different layers, a `429` can
come from either:

- **Application limit (per-account, per-minute):** the body above, with `Retry-After` and
  `X-RateLimit-*` headers and a `requestId` you can quote to support.
- **Edge limit (per-IP, flood protection):** a `429` from the firewall with the same
  `errorCode: "RATE_LIMITED"` discriminator, but no `requestId` (the edge has no per-request
  correlation id). If you contact support about an edge `429`, quote the `x-amz-cf-id` response
  header instead.

In both cases the **status is `429`** and the body carries **`errorCode: "RATE_LIMITED"`**, so a
single check on the status (and, if you want, the `errorCode`) handles either layer.

---

## Notes & limits

- The per-minute window is **fixed**, not rolling: bursting right before a minute boundary and
  again right after briefly allows up to two windows' worth of requests in a short span. Pace to
  your average rate, not the instantaneous one.
- Reads are **not** counted by the per-account limit today. Plan your read-heavy workloads
  against the edge per-IP ceiling, not this limit.
- The limit is selected from your plan at request time; upgrading your plan raises it
  immediately on the next request.
