Skip to main content
POST
/
v1
/
chat
/
completions
Create chat completion
curl --request POST \
  --url https://api.cogito.decart.ai/v1/chat/completions \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "<string>",
  "messages": [
    {}
  ],
  "stream": true,
  "temperature": 123,
  "top_p": 123,
  "max_tokens": 123,
  "max_completion_tokens": 123,
  "tools": [
    {}
  ],
  "tool_choice": {},
  "response_format": {},
  "stop": [
    "<string>"
  ],
  "frequency_penalty": 123,
  "presence_penalty": 123,
  "reasoning_effort": "<string>",
  "chat_template_kwargs": {}
}
'
OpenAI-compatible chat completions endpoint. Streaming and non-streaming.

Request body

model
string
required
Model slug from the catalog. Example: gpt-oss-120b.
messages
array
required
Conversation history. Each entry is { role: "system" | "user" | "assistant" | "tool", content: string }.
stream
boolean
default:"false"
When true, responses are streamed as Server-Sent Events. See the streaming guide.
temperature
number
default:"1"
Sampling temperature, 0 to 2. Lower → more deterministic.
top_p
number
default:"1"
Nucleus sampling. Use either temperature or top_p, not both.
max_tokens
integer
Maximum output tokens. Capped per tier — see pricing. Clamped to the model’s max_output_length (visible on /v1/models).
max_completion_tokens
integer
Same semantics as max_tokens; OpenAI’s canonical field for o1/o3 reasoning models. Either field is accepted; if both are sent, max_completion_tokens wins. Clamped to the model’s max_output_length.
tools
array
Available tools the model may call. See function calling.
tool_choice
string | object
default:"auto"
"auto" (default), "none", or { type: "function", function: { name: ... } } to force.
response_format
object
{ type: "json_object" } or { type: "json_schema", json_schema: {...} }. See structured outputs.
stop
string | string[]
Up to 4 stop sequences.
frequency_penalty
number
default:"0"
-2.0 to 2.0. Penalize tokens by their frequency in the response so far.
presence_penalty
number
default:"0"
-2.0 to 2.0. Penalize tokens that have appeared at all.
reasoning_effort
string
Reasoning effort hint for models that emit a chain of thought — accepted as the standard OpenAI top-level field. For DeepSeek-V4 the gateway mirrors this value into chat_template_kwargs.reasoning_effort and strips the top-level field before forwarding, because DeepSeek-V4’s chat template only consumes the engine-specific form. Without this mirror, top-level reasoning_effort is silently a no-op on V4 (it’s also a SamplingParams interference source when the value is outside OpenAI’s enum). The OpenRouter-style alias "xhigh" is mapped to DeepSeek’s "max" (“Think Max” mode). DeepSeek-V4 documents "high" and "max"; other OpenAI tiers (minimal | low | medium) are forwarded literally but fall back to the encoder’s default branch on this build — i.e. they don’t 400 but may produce reasoning depth indistinguishable from sending no hint. Disable thinking entirely with chat_template_kwargs.enable_thinking: false. If you set chat_template_kwargs.reasoning_effort explicitly, the gateway honors your value and leaves the top-level field alone.
chat_template_kwargs
object
Engine-specific chat-template knobs forwarded verbatim to the upstream. On DeepSeek-V4: { enable_thinking: false } disables reasoning and routes output directly to content; { drop_thinking: true } drops prior assistant reasoning_content from the encoded prompt. Cogito defaults drop_thinking to false for reasoning-capable models so multi-turn requests preserve prior reasoning traces unless you explicitly opt out; { reasoning_effort: "high" | "max" } selects the model’s Think-High vs Think-Max mode. The gateway force-injects enable_thinking: false when response_format is set (so JSON-mode and structured outputs land in content, not reasoning_content); any caller-supplied value here always wins.

Response (non-streaming)

{
  "id": "req_...",
  "object": "chat.completion",
  "created": 1714521600,
  "model": "gpt-oss-120b",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "..." },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 32,
    "completion_tokens": 71,
    "total_tokens": 103,
    "prompt_tokens_details": { "cached_tokens": 0 },
    "completion_tokens_details": { "reasoning_tokens": 0 }
  }
}

Response (streaming)

Content-Type: text/event-stream. Each event is data: <chat.completion.chunk JSON>. Stream ends with data: [DONE]. See the streaming guide.

Headers on every response

  • x-request-id — opaque ID. Log it. We trace it through every layer.
  • x-tokens-used — billed total for this request (omitted on errors that didn’t consume tokens).