> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cogito.decart.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Model catalog

> Available open-source models, with context windows, output caps, throughput, and pricing.

Cogito serves a curated set of frontier open-source models — **Kimi K2.6 · DeepSeek V4 (Flash & Pro) · GPT-OSS · Qwen**.

The full served catalog below. We add models the day they ship; older models stay supported with deprecation notice.

| Model                       | Family   | Context | Max output | TPS | License          |
| --------------------------- | -------- | ------- | ---------- | --- | ---------------- |
| `moonshotai/kimi-k2.6`      | Moonshot | 256K    | 256K       | 70  | Modified MIT     |
| `moonshotai/kimi-k2.6:fast` | Moonshot | 256K    | 256K       | 70  | Modified MIT     |
| `kimi-tt`                   | Moonshot | 256K    | 256K       | 70  | Modified MIT     |
| `deepseek-v4-pro`           | DeepSeek | 1M      | 128K       | 70  | DeepSeek License |
| `deepseek-v4-flash`         | DeepSeek | 1M      | 1M         | 70  | DeepSeek License |
| `gpt-oss-120b`              | OpenAI   | 128K    | 32K        | 70  | Apache 2.0       |
| `qwen-3-235b`               | Alibaba  | 256K    | 32K        | 70  | Apache 2.0       |

**Live pricing** — input, cached input, and output rates per model — is published on the [model catalog](https://cogito.decart.ai/models) and returned by `GET /v1/models`. Single source of truth so the website, the gateway, and your code never drift.

For Kimi K2.6 routes, Moonshot documents the limit as input plus output fitting within the 256K context window; Cogito therefore advertises a 256K max-output cap while upstream may still reject requests whose prompt leaves insufficient room.

**Throughput** is locked at 70 tokens/sec across the fleet for the MVP — we run a uniform serving target while we tune Trainium / GPU autoscaling. Real-world per-request throughput varies with prompt length and concurrent batch saturation.

**Hardware** is managed by Cogito. We route each model to the right silicon for the workload and may transparently shift between tiers if it produces lower P99 — output is bit-identical either way.

## Capabilities

All catalog models support:

* Streaming SSE
* Function / tool calling (OpenAI-shape `tools[]`)
* Structured JSON outputs (grammar-constrained decoding)
* Multi-turn chat with system prompts

## Picking a model

* **Default agent / high-volume RAG** → `deepseek-v4-flash`. Cheap, 1M-token window, the live workhorse today.
* **Hardest reasoning + long context** → `deepseek-v4-pro`. 1M-token window, frontier reasoning.
* **Coding agents and tool-heavy workflows** → `moonshotai/kimi-k2.6`. Served from the AWS B300 high-capacity route.
* **Latency-sensitive Kimi requests** → `moonshotai/kimi-k2.6:fast`. CoreWeave B200 low-latency route; lower per-route capacity than the default.
* **Experimental high-concurrency Kimi route** → `kimi-tt`.
* **Cheap general chat / coding** → `gpt-oss-120b`. Cheapest path to GPT-4-class output quality.
* **Multilingual + tool use** → `qwen-3-235b`. Strong non-English coverage.

For more on each model, see the per-model pages in the sidebar.
