Overview - Cogito

Stop watching your agent type.

Cogito runs your agents 14× faster than typical inference providers — on the open-source models you already use. Calibrated against the ~70 tok/s baseline that Together publishes for Llama 70B class models; Cogito’s premium tier sustains 1000+ tok/s on supported workloads. Cogito is the LLM inference platform from Decart, an efficiency-first research lab. It’s built on DOS — the Decart Optimization Stack — which extracts frontier performance from any silicon. The same stack that runs Lucy 2.0 (a large diffusion model) at sub-50ms on AWS Trainium now serves open-source LLMs at 14× the speed of typical providers. Now serving: Kimi K2.6 · DeepSeek V4 (Flash & Pro) · GPT-OSS · Qwen. Inference runs on a hybrid fleet: AWS Trainium for cost-efficient throughput, NVIDIA Blackwell for the largest configurations. Cogito picks the right silicon per workload; you call one API.

Built on DOS

Cogito is built on DOS — the Decart Optimization Stack — a multi-silicon performance layer that routes workloads to optimal silicon and extracts frontier throughput. The same stack runs Lucy 2.0, a large diffusion model, at sub-50ms latency on AWS Trainium. For LLM inference, DOS lets Cogito serve open-source models at 14× the speed of typical providers — particularly for agentic workloads, where multi-step inference latency compounds.

Drop-in OpenAI compatible

The whole API matches the OpenAI spec — chat completions, legacy text completions, streaming SSE, function calling, structured JSON outputs. Swap two values in your existing code and you’re on Cogito:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.cogito.decart.ai/v1",  # was: https://api.openai.com/v1
    api_key=os.environ["COGITO_API_KEY"],
)

Three-step quickstart

Create an API key

Pick a model

Browse the model catalog. Each model lists context window, throughput, and per-million-token pricing.

Send a request

Copy the quickstart snippet. Five-minute time-to-first-token.

Why Cogito

Trainium + Blackwell

Hybrid fleet, automatic routing. Frontier-tier throughput on the open-source models you actually use.

Built for operators

Token-aware rate limits, deterministic errors with request_id, hard spend caps, zero retention by default.

Cogito, ergo ship.

Five-minute TTFT. Same API for production and your fine-tunes. Get back to shipping.

Quickstart

​Built on DOS

​Drop-in OpenAI compatible

​Three-step quickstart

​Why Cogito