| Model | Family | Context | Max output | TPS | License |
|---|---|---|---|---|---|
moonshotai/kimi-k2.6 | Moonshot | 256K | 256K | 70 | Modified MIT |
moonshotai/kimi-k2.6:fast | Moonshot | 256K | 256K | 70 | Modified MIT |
kimi-tt | Moonshot | 256K | 256K | 70 | Modified MIT |
deepseek-v4-pro | DeepSeek | 1M | 128K | 70 | DeepSeek License |
deepseek-v4-flash | DeepSeek | 1M | 1M | 70 | DeepSeek License |
gpt-oss-120b | OpenAI | 128K | 32K | 70 | Apache 2.0 |
qwen-3-235b | Alibaba | 256K | 32K | 70 | Apache 2.0 |
GET /v1/models. Single source of truth so the website, the gateway, and your code never drift.
For Kimi K2.6 routes, Moonshot documents the limit as input plus output fitting within the 256K context window; Cogito therefore advertises a 256K max-output cap while upstream may still reject requests whose prompt leaves insufficient room.
Throughput is locked at 70 tokens/sec across the fleet for the MVP — we run a uniform serving target while we tune Trainium / GPU autoscaling. Real-world per-request throughput varies with prompt length and concurrent batch saturation.
Hardware is managed by Cogito. We route each model to the right silicon for the workload and may transparently shift between tiers if it produces lower P99 — output is bit-identical either way.
Capabilities
All catalog models support:- Streaming SSE
- Function / tool calling (OpenAI-shape
tools[]) - Structured JSON outputs (grammar-constrained decoding)
- Multi-turn chat with system prompts
Picking a model
- Default agent / high-volume RAG →
deepseek-v4-flash. Cheap, 1M-token window, the live workhorse today. - Hardest reasoning + long context →
deepseek-v4-pro. 1M-token window, frontier reasoning. - Coding agents and tool-heavy workflows →
moonshotai/kimi-k2.6. Served from the AWS B300 high-capacity route. - Latency-sensitive Kimi requests →
moonshotai/kimi-k2.6:fast. CoreWeave B200 low-latency route; lower per-route capacity than the default. - Experimental high-concurrency Kimi route →
kimi-tt. - Cheap general chat / coding →
gpt-oss-120b. Cheapest path to GPT-4-class output quality. - Multilingual + tool use →
qwen-3-235b. Strong non-English coverage.