Create chat completion
Chat
Create chat completion
POST /v1/chat/completions
POST
Create chat completion
OpenAI-compatible chat completions endpoint. Streaming and non-streaming.
Request body
Conversation history. Each entry is
{ role: "system" | "user" | "assistant" | "tool", content: string }.When
true, responses are streamed as Server-Sent Events. See the streaming guide.Sampling temperature,
0 to 2. Lower → more deterministic.Nucleus sampling. Use either
temperature or top_p, not both.Maximum output tokens. Capped per tier — see pricing.
Clamped to the model’s
max_output_length (visible on /v1/models).Same semantics as
max_tokens; OpenAI’s canonical field for o1/o3
reasoning models. Either field is accepted; if both are sent,
max_completion_tokens wins. Clamped to the model’s
max_output_length.Available tools the model may call. See function calling.
"auto" (default), "none", or { type: "function", function: { name: ... } } to force.{ type: "json_object" } or { type: "json_schema", json_schema: {...} }. See structured outputs.Up to 4 stop sequences.
-2.0 to 2.0. Penalize tokens by their frequency in the response so far.-2.0 to 2.0. Penalize tokens that have appeared at all.Reasoning effort hint for models that emit a chain of thought —
accepted as the standard OpenAI top-level field. For DeepSeek-V4
the gateway mirrors this value into
chat_template_kwargs.reasoning_effort and strips the top-level
field before forwarding, because DeepSeek-V4’s chat template only
consumes the engine-specific form. Without this mirror, top-level
reasoning_effort is silently a no-op on V4 (it’s also a
SamplingParams interference source when the value is outside
OpenAI’s enum). The OpenRouter-style alias "xhigh" is mapped to
DeepSeek’s "max" (“Think Max” mode). DeepSeek-V4 documents
"high" and "max"; other OpenAI tiers
(minimal | low | medium) are forwarded literally but fall back to
the encoder’s default branch on this build — i.e. they don’t 400
but may produce reasoning depth indistinguishable from sending no
hint. Disable thinking entirely with
chat_template_kwargs.enable_thinking: false. If you set
chat_template_kwargs.reasoning_effort explicitly, the gateway
honors your value and leaves the top-level field alone.Engine-specific chat-template knobs forwarded verbatim to the
upstream. On DeepSeek-V4:
{ enable_thinking: false } disables
reasoning and routes output directly to content;
{ drop_thinking: true } drops prior assistant reasoning_content
from the encoded prompt. Cogito defaults drop_thinking to false
for reasoning-capable models so multi-turn requests preserve prior
reasoning traces unless you explicitly opt out;
{ reasoning_effort: "high" | "max" } selects the model’s
Think-High vs Think-Max mode. The gateway force-injects
enable_thinking: false when response_format is set (so JSON-mode
and structured outputs land in content, not reasoning_content);
any caller-supplied value here always wins.Response (non-streaming)
Response (streaming)
Content-Type: text/event-stream. Each event is data: <chat.completion.chunk JSON>. Stream ends with data: [DONE]. See the streaming guide.
Headers on every response
x-request-id— opaque ID. Log it. We trace it through every layer.x-tokens-used— billed total for this request (omitted on errors that didn’t consume tokens).