Stop watching your agent type.Cogito runs your agents 14× faster than typical inference providers — on the open-source models you already use. Calibrated against the ~70 tok/s baseline that Together publishes for Llama 70B class models; Cogito’s premium tier sustains 1000+ tok/s on supported workloads. Cogito is the LLM inference platform from Decart, an efficiency-first research lab. It’s built on DOS — the Decart Optimization Stack — which extracts frontier performance from any silicon. The same stack that runs Lucy 2.0 (a large diffusion model) at sub-50ms on AWS Trainium now serves open-source LLMs at 14× the speed of typical providers. Now serving: Kimi K2.6 · DeepSeek V4 (Flash & Pro) · GPT-OSS · Qwen. Inference runs on a hybrid fleet: AWS Trainium for cost-efficient throughput, NVIDIA Blackwell for the largest configurations. Cogito picks the right silicon per workload; you call one API.
Built on DOS
Cogito is built on DOS — the Decart Optimization Stack — a multi-silicon performance layer that routes workloads to optimal silicon and extracts frontier throughput. The same stack runs Lucy 2.0, a large diffusion model, at sub-50ms latency on AWS Trainium. For LLM inference, DOS lets Cogito serve open-source models at 14× the speed of typical providers — particularly for agentic workloads, where multi-step inference latency compounds.Drop-in OpenAI compatible
The whole API matches the OpenAI spec — chat completions, legacy text completions, streaming SSE, function calling, structured JSON outputs. Swap two values in your existing code and you’re on Cogito:Three-step quickstart
Create an API key
Sign up at cogito.decart.ai/sign-up. You start with free credits — no card required.
Pick a model
Browse the model catalog. Each model lists context window, throughput, and per-million-token pricing.
Send a request
Copy the quickstart snippet. Five-minute time-to-first-token.
Why Cogito
Trainium + Blackwell
Hybrid fleet, automatic routing. Frontier-tier throughput on the open-source models you actually use.
Built for operators
Token-aware rate limits, deterministic errors with
request_id, hard spend caps, zero retention by default.Cogito, ergo ship.
Five-minute TTFT. Same API for production and your fine-tunes. Get back to shipping.