> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cogito.decart.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Create chat completion

> POST /v1/chat/completions

OpenAI-compatible chat completions endpoint. Streaming and non-streaming.

## Request body

<ParamField body="model" type="string" required>
  Model slug from the [catalog](/getting-started/models). Example: `gpt-oss-120b`.
</ParamField>

<ParamField body="messages" type="array" required>
  Conversation history. Each entry is `{ role: "system" | "user" | "assistant" | "tool", content: string }`.
</ParamField>

<ParamField body="stream" type="boolean" default="false">
  When `true`, responses are streamed as Server-Sent Events. See the [streaming guide](/guides/streaming).
</ParamField>

<ParamField body="temperature" type="number" default="1">
  Sampling temperature, `0` to `2`. Lower → more deterministic.
</ParamField>

<ParamField body="top_p" type="number" default="1">
  Nucleus sampling. Use either `temperature` or `top_p`, not both.
</ParamField>

<ParamField body="max_tokens" type="integer">
  Maximum output tokens. Capped per tier — see [pricing](/getting-started/pricing).
  Clamped to the model's `max_output_length` (visible on `/v1/models`).
</ParamField>

<ParamField body="max_completion_tokens" type="integer">
  Same semantics as `max_tokens`; OpenAI's canonical field for o1/o3
  reasoning models. Either field is accepted; if both are sent,
  `max_completion_tokens` wins. Clamped to the model's
  `max_output_length`.
</ParamField>

<ParamField body="tools" type="array">
  Available tools the model may call. See [function calling](/guides/function-calling).
</ParamField>

<ParamField body="tool_choice" type="string | object" default="auto">
  `"auto"` (default), `"none"`, or `{ type: "function", function: { name: ... } }` to force.
</ParamField>

<ParamField body="response_format" type="object">
  `{ type: "json_object" }` or `{ type: "json_schema", json_schema: {...} }`. See [structured outputs](/guides/structured-outputs).
</ParamField>

<ParamField body="stop" type="string | string[]">
  Up to 4 stop sequences.
</ParamField>

<ParamField body="frequency_penalty" type="number" default="0">
  `-2.0` to `2.0`. Penalize tokens by their frequency in the response so far.
</ParamField>

<ParamField body="presence_penalty" type="number" default="0">
  `-2.0` to `2.0`. Penalize tokens that have appeared at all.
</ParamField>

<ParamField body="reasoning_effort" type="string">
  Reasoning effort hint for models that emit a chain of thought —
  accepted as the standard OpenAI top-level field. For DeepSeek-V4
  the gateway mirrors this value into
  `chat_template_kwargs.reasoning_effort` and strips the top-level
  field before forwarding, because DeepSeek-V4's chat template only
  consumes the engine-specific form. Without this mirror, top-level
  `reasoning_effort` is silently a no-op on V4 (it's also a
  SamplingParams interference source when the value is outside
  OpenAI's enum). The OpenRouter-style alias `"xhigh"` is mapped to
  DeepSeek's `"max"` ("Think Max" mode). DeepSeek-V4 documents
  `"high"` and `"max"`; other OpenAI tiers
  (`minimal | low | medium`) are forwarded literally but fall back to
  the encoder's default branch on this build — i.e. they don't 400
  but may produce reasoning depth indistinguishable from sending no
  hint. Disable thinking entirely with
  `chat_template_kwargs.enable_thinking: false`. If you set
  `chat_template_kwargs.reasoning_effort` explicitly, the gateway
  honors your value and leaves the top-level field alone.
</ParamField>

<ParamField body="chat_template_kwargs" type="object">
  Engine-specific chat-template knobs forwarded verbatim to the
  upstream. On DeepSeek-V4: `{ enable_thinking: false }` disables
  reasoning and routes output directly to `content`;
  `{ drop_thinking: true }` drops prior assistant `reasoning_content`
  from the encoded prompt. Cogito defaults `drop_thinking` to `false`
  for reasoning-capable models so multi-turn requests preserve prior
  reasoning traces unless you explicitly opt out;
  `{ reasoning_effort: "high" | "max" }` selects the model's
  Think-High vs Think-Max mode. The gateway force-injects
  `enable_thinking: false` when `response_format` is set (so JSON-mode
  and structured outputs land in `content`, not `reasoning_content`);
  any caller-supplied value here always wins.
</ParamField>

## Response (non-streaming)

```json theme={null}
{
  "id": "req_...",
  "object": "chat.completion",
  "created": 1714521600,
  "model": "gpt-oss-120b",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "..." },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 32,
    "completion_tokens": 71,
    "total_tokens": 103,
    "prompt_tokens_details": { "cached_tokens": 0 },
    "completion_tokens_details": { "reasoning_tokens": 0 }
  }
}
```

## Response (streaming)

`Content-Type: text/event-stream`. Each event is `data: <chat.completion.chunk JSON>`. Stream ends with `data: [DONE]`. See the [streaming guide](/guides/streaming).

## Headers on every response

* `x-request-id` — opaque ID. Log it. We trace it through every layer.
* `x-tokens-used` — billed total for this request (omitted on errors that didn't consume tokens).
