ComputeBoardComputeBoard
DocsGet StartedIntroduction

Introduction

Last updated June 29, 2026 Edit this page

ComputeBoard is a single, OpenAI-compatible endpoint that gives you access to every major AI model through one API key. Instead of integrating each provider separately and guessing which model to use, you send one request and our router picks the best model for it in real time — scoring on latency, cost, availability, and performance — then returns an OpenAI-shaped response.

One endpoint. One key. One bill. ComputeBoard sits between your application and the underlying model providers, so you can ship against a stable interface while the routing layer keeps your traffic on the fastest, cheapest, and most reliable model that meets the quality bar for each request.

What is ComputeBoard

ComputeBoard is an intelligent AI gateway. Every request to https://api.computeboard.xyz/v1/chat/completions is evaluated by a smart router that selects a model on a per-request basis. When you send model: "auto" (the default), the router weighs four criteria for the specific prompt you sent:

  • Latency — measured time-to-first-token and total response time across providers, so interactive workloads stay fast.
  • Cost — live per-token pricing for each candidate model, used to avoid overpaying for requests that a cheaper model can answer just as well.
  • Availability — real-time provider health and capacity, with automatic failover away from rate-limited or degraded endpoints.
  • Performance — task-quality signals (reasoning, coding, vision, long-context) that match the prompt to a model strong enough to handle it.

Why ComputeBoard

  • One API for every model — integrate once and reach 37+ models from leading providers without writing a new client for each.
  • No vendor lock-in — the interface is the standard OpenAI shape. You can pin a specific model, route to a class, or fall back to direct provider access at any time. Nothing about ComputeBoard is proprietary in your code.
  • Automatic savings — by routing cheaper-but-capable models when quality allows, ComputeBoard reduces spend on requests that do not need a frontier model. Each response reports how much you saved versus a fixed-frontier baseline.
  • Built-in reliability — when a provider is slow or down, the router fails over to the next-best model instead of surfacing an error to your users.

How it works

Each request flows through the same path. You send a standard chat completion; the router scores every eligible model against your prompt and current conditions; the winning model serves the request; and you receive an OpenAI-shaped response with a small computeboard metadata block that tells you exactly which model handled it and what it saved.

StepWhat happens
1 · RequestYour app POSTs an OpenAI-shaped chat completion with model: "auto".
2 · RouterEligible models are filtered by required capabilities (vision, tools, context length).
3 · ScoringEach candidate is scored on latency, cost, availability, and performance.
4 · DispatchThe highest-scoring healthy model serves the request; failover is automatic.
5 · ResponseAn OpenAI-shaped result returns with a computeboard meta block (routed_to, baseline, saved_pct).

Drop-in compatible

ComputeBoard speaks the OpenAI API. If you already use an OpenAI SDK, the only change you need is the base URL and your ComputeBoard key — your existing code keeps working.

client.js
import OpenAI from "openai"; const client = new OpenAI({  apiKey: process.env.COMPUTEBOARD_API_KEY, // ck_live_...  baseURL: "https://api.computeboard.xyz/v1",                   // the only change vs OpenAI}); const res = await client.chat.completions.create({  model: "auto",  messages: [{ role: "user", content: "Hello from ComputeBoard" }],}); console.log(res.choices[0].message.content);
OpenAI SDK compatible
ComputeBoard is fully compatible with the official OpenAI SDKs and any OpenAI-compatible client. Point the base URL at https://api.computeboard.xyz/v1, use your ck_live_ key, and every method you already call — chat completions, streaming, tools — works unchanged.
DocsGet StartedQuick Start

Quick Start

Last updated June 29, 2026 Edit this page

Go from zero to your first routed completion in under five minutes. ComputeBoard is OpenAI-compatible, so you can use the official SDKs and only change the base URL and key.

1

Create an API key

Open the dashboard and go to the API Keys page. Click Create key, give it a name (for example production), and copy the key that is shown. It begins with ck_live_and is only displayed once, so store it somewhere safe — a secret manager or your deployment's environment variables.

2

Install the SDK

ComputeBoard works with the official OpenAI SDK. Install it for your language:

npm
npm install openai
3

Make your first request

Point the client at https://api.computeboard.xyz/v1, pass your key, and send a chat completion with model: "auto" to let the router choose the best model.

first-request.js
import OpenAI from "openai"; const client = new OpenAI({  apiKey: process.env.COMPUTEBOARD_API_KEY, // ck_live_...  baseURL: "https://api.computeboard.xyz/v1",}); const res = await client.chat.completions.create({  model: "auto",  messages: [    { role: "user", content: "Explain what an AI gateway does in one sentence." },  ],}); console.log(res.choices[0].message.content);console.log("routed to:", res.computeboard.routed_to);
4

Receive the response

You get back a standard OpenAI chat completion. ComputeBoard adds a single extra field — computeboard — describing which model served the request and how much it saved versus always using a frontier model.

response.json
{  "id": "chatcmpl_8x2pQ1vK4mZ",  "object": "chat.completion",  "created": 1751212800,  "model": "claude-haiku-4.5",  "choices": [    {      "index": 0,      "message": {        "role": "assistant",        "content": "An AI gateway is a single API that routes each request to the best available model so you don't have to integrate or choose providers yourself."      },      "finish_reason": "stop"    }  ],  "usage": {

Full example

Prefer to test without an SDK? This single curl command is the complete request — copy it, paste your key, and run it.

first-request.sh
curl https://api.computeboard.xyz/v1/chat/completions \  -H "Authorization: Bearer ck_live_xxxxxxxxxxxxxxxxxxxxxxxx" \  -H "Content-Type: application/json" \  -d '{    "model": "auto",    "messages": [      { "role": "user", "content": "Write a haiku about smart routing." }    ]  }'
Start with auto
Leave model as "auto" unless you have a reason not to. The router picks the best model per request and reports its choice in the computeboard meta. You can always pin a specific model — or use "cheapest", "fastest", or "best" — once you know your workload.
DocsGet StartedAuthentication

Authentication

Last updated June 29, 2026 Edit this page

ComputeBoard authenticates every request with a secret API key passed as a Bearer token. There are no sessions, cookies, or signatures to manage — one header authorizes your call.

API keys

Authentication uses the standard Authorization: Bearer scheme. Create a key in the dashboard, then send it on every request to https://api.computeboard.xyz/v1. Live keys are prefixed ck_live_. Requests without a valid key are rejected with 401 Unauthorized.

auth.sh
curl https://api.computeboard.xyz/v1/chat/completions \  -H "Authorization: Bearer ck_live_xxxxxxxxxxxxxxxxxxxxxxxx" \  -H "Content-Type: application/json" \  -d '{    "model": "auto",    "messages": [{ "role": "user", "content": "ping" }]  }'

When you use an OpenAI SDK, the client sets this header for you — just pass your key as apiKey (JavaScript) or api_key (Python) and point baseURL at ComputeBoard.

Keeping keys secure

An API key is a credential that can spend money and read your usage. Treat it like a password and follow these practices:

  • Never expose keys in client code. Browsers, mobile apps, and any shipped frontend can be inspected — a key embedded there is effectively public.
  • Call ComputeBoard from your server. Proxy requests through your own backend so the key never leaves an environment you control.
  • Load keys from environment variables (for example COMPUTEBOARD_API_KEY) or a secret manager — never hard-code them in source.
  • Keep keys out of version control. Add .env files to .gitignore and scan commits for accidental secrets.
  • Scope one key per environment. Use separate keys for development, staging, and production so a leak in one place cannot affect the others.
  • Rotate regularly, and immediately on any suspected exposure.
Leaked keys can be abused
Anyone with your ck_live_ key can make requests billed to your account. If a key is committed to a repository, posted in a chat, or shipped to a browser, revoke it in the dashboard immediately and issue a replacement. Never log full keys or include them in error reports.

Example request

A complete authenticated request in JavaScript and Python. The key is read from an environment variable so it never appears in your source code.

request.js
import OpenAI from "openai"; const client = new OpenAI({  apiKey: process.env.COMPUTEBOARD_API_KEY, // ck_live_... from your environment  baseURL: "https://api.computeboard.xyz/v1",}); const res = await client.chat.completions.create({  model: "auto",  messages: [{ role: "user", content: "Authenticated request OK?" }],}); console.log(res.choices[0].message.content);
DocsGet StartedAPI Keys

API Keys

Last updated June 29, 2026 Edit this page

API keys authorize your requests to ComputeBoard. You create, rotate, and revoke them from the dashboard, and you can hold as many as you need — one per service or environment is a good default.

Creating a key

Open the dashboard and go to API Keys, then click Create key. Give it a descriptive name so you can tell keys apart later (for example web-prod or worker-staging). The full secret is shown once, at creation time — copy it immediately into your secret manager or environment, because ComputeBoard stores only a hashed version and cannot display it again.

Key format
# Live key — shown once at creation, store it securelyck_live_4f8c2a9d1e7b6034a5c9f12d8e0b7a36 # Reference it from an environment variable, never inlineexport COMPUTEBOARD_API_KEY="ck_live_4f8c2a9d1e7b6034a5c9f12d8e0b7a36"

After creation the dashboard only ever shows the key's prefix (for example ck_live_4f8c…) so you can identify it without revealing the secret.

FieldDescription
NameA label you assign to identify the key (e.g. web-prod).
PrefixFirst characters of the key, shown for identification (ck_live_4f8c…).
CreatedTimestamp the key was issued.
Last usedTimestamp of the most recent request authorized by this key.
StatusActive or Revoked.

Revoking a key

Revoke a key the moment it is no longer needed or you suspect it has leaked. Revocation is immediate and permanent — any request using a revoked key is rejected with 401 Unauthorized.

1

Open the API Keys page

Go to the API Keys page in the dashboard and find the key by its name or prefix.
2

Revoke it

Click Revoke and confirm. The key stops working instantly across all environments.
3

Replace where needed

If the key was in active use, create a replacement and deploy it before — or alongside — revoking, to avoid downtime (see rotating keys below).

Rotating keys

Rotation replaces a key without an interruption in service. Because ComputeBoard lets you hold multiple active keys at once, you can run the old and new keys side by side during the switch — the classic create-deploy-revoke flow, with zero downtime.

1

Create a new key

Issue a fresh key in the dashboard and copy its secret into your secret manager.
2

Deploy the new key

Update your environment variables and roll out the change. Both keys are valid at this point, so traffic never breaks mid-deploy.
3

Verify the new key is live

Confirm the new key's Last used timestamp is advancing and the old key has gone quiet in the dashboard.
4

Revoke the old key

Once all traffic is on the new key, revoke the old one. The rotation is complete with no dropped requests.

Best practices

  • One key per service and environment. Separate keys for each app and for dev, staging, and production limit the blast radius of a leak and make usage easy to attribute.
  • Minimize exposure. Keep keys server-side, load them from environment variables or a secret manager, and never embed them in client code, logs, or version control.
  • Monitor usage.Watch each key's request volume and last-used time in the dashboard; unexpected activity is an early signal of a leak or a misconfiguration.
  • Rotate on a schedule — and immediately on suspicion. Rotate keys periodically as a matter of hygiene, and revoke and replace any key the instant you think it may be exposed.
Automate rotation
Wire the create-deploy-revoke flow into your secret manager and deployment pipeline so rotation is a routine, non-event. Holding multiple active keys at once means you can rotate as often as your security policy requires without ever taking downtime.
DocsCore APIModels

Models

Last updated June 29, 2026 Edit this page

ComputeBoard speaks to every major model provider through one OpenAI-compatible endpoint. Instead of integrating, billing, and maintaining a separate SDK for each vendor, you call a single API and let the router put your request in front of the right model.

The fastest way to use the catalog is not to pick a model at all. Set model: "auto" and the router scores every healthy candidate on latency, cost, availability, and measured quality for your prompt, then dispatches to the best one — typically in under a millisecond of overhead. Every response tells you exactly which model answered and how much it saved versus a fixed baseline, so you keep full visibility while the platform does the work.

Prefer to stay in control? You can pin any model by its slug, or steer the router with a high-level policy like cheapest, fastest, or best. All four behaviors share the same request and response shape, so switching is a one-line change.

Available models

These models are routable today. Capabilities such as vision and tool (function) calling are normalized across providers, so the same request works no matter where it lands.

GPT-4.1OpenAI
Context1M tokens
Availability99.9%
VisionFunctionsStreaming
Claude 4Anthropic
Context200K tokens
Availability99.8%
VisionFunctionsStreaming
Gemini 2.5Google
Context1M tokens
Availability99.5%
VisionFunctionsStreaming
DeepSeekDeepSeek
Context128K tokens
Availability99.0%
VisionFunctionsStreaming
LlamaMeta
Context128K tokens
Availability99.2%
VisionFunctionsStreaming
QwenAlibaba
Context131K tokens
Availability98.9%
VisionFunctionsStreaming
GrokxAI
Context131K tokens
Availability98.6%
VisionFunctionsStreaming
MistralMistral
Context128K tokens
Availability99.1%
VisionFunctionsStreaming

Choosing a model

The model field on a chat completion request accepts four kinds of values. Three are routing policies that leave the choice to ComputeBoard, and the fourth is a fixed model slug that pins the request to one specific model.

modelBehavior
"auto"Balanced default. Scores every healthy model on latency, cost, availability, and prompt-fit, then routes to the best overall trade-off. Recommended for most workloads.
"cheapest"Optimize for cost. Routes to the lowest-priced model that can satisfy the request, falling back up the price ladder only when a cheaper model is unavailable.
"fastest"Optimize for latency. Routes to the model with the lowest current time-to-first-token and end-to-end latency. Ideal for interactive UIs.
"best"Optimize for quality. Routes to the highest-scoring model on capability and measured output quality, regardless of price.
"gpt-4.1"Pin a specific model by slug (e.g. gpt-4.1, claude-4, gemini-2.5). Routing is bypassed; the request always goes to that model. Use when you need deterministic, reproducible behavior.

When you pin a slug and that model is temporarily unavailable, the request fails fast with a clear error rather than silently switching models — pinning means you get exactly what you asked for. Policies, by contrast, are designed to route around outages automatically.

Listing models

Fetch the live catalog at runtime with a standard, OpenAI-shaped list endpoint. The response includes every model slug you can pin, so you can build dynamic model pickers without hard-coding names.

list-models.sh
curl https://api.computeboard.xyz/v1/models \  -H "Authorization: Bearer ck_live_..."
Note
Model availability, capabilities, and pricing evolve as providers ship new versions. Treat the values above as a snapshot — call GET /v1/models for the authoritative, current list, and rely on model: "auto" to keep adopting better models without code changes.
DocsCore APIChat Completions

Chat Completions

Last updated June 29, 2026 Edit this page

Chat Completions is the core of the ComputeBoard API. It is fully compatible with the OpenAI Chat Completions schema, so any existing OpenAI SDK or integration works by changing only the base URL and API key. Send a list of messages, get a model reply — and let the router pick the best model for each request.

POST/v1/chat/completions

Every response is the standard OpenAI shape, plus one extra computeboard object that reports which model actually handled the request, the baseline it was compared against, the observed latency, and how much you saved.

Request

Set model to "auto" to let the router choose, and pass an array of messages. The examples below are identical across transports.

request.sh
curl https://api.computeboard.xyz/v1/chat/completions \  -H "Authorization: Bearer ck_live_..." \  -H "Content-Type: application/json" \  -d '{    "model": "auto",    "messages": [      { "role": "user", "content": "Explain quantum entanglement in one sentence." }    ]  }'

Parameters

modelstringoptional
Routing policy or model slug. Defaults to "auto". Accepts the policies "auto", "cheapest", "fastest", "best", or a specific slug such as "gpt-4.1" to bypass routing.
messagesarrayrequired
The conversation so far, as a list of message objects. Each has a role ("system", "user", "assistant", or "tool") and a content string (or content-part array for vision-capable models).
streambooleanoptional
When true, partial deltas are sent as Server-Sent Events instead of a single response. Defaults to false. See the Streaming guide.
max_tokensintegeroptional
Maximum number of tokens to generate in the completion. The combined prompt and completion length must fit the routed model's context window.
temperaturenumberoptional
Sampling temperature between 0 and 2. Higher values (e.g. 0.8) make output more random; lower values (e.g. 0.2) make it more focused and deterministic. Defaults to 1.
top_pnumberoptional
Nucleus sampling: the model considers tokens within the top top_p probability mass. 0.1 means only the top 10% are considered. Use this or temperature, not both. Defaults to 1.
stopstring | string[]optional
Up to four sequences where generation stops. The returned text does not contain the stop sequence.
nintegeroptional
Number of completion choices to generate for each prompt. Defaults to 1.
presence_penaltynumberoptional
Number between -2 and 2. Positive values penalize tokens that have already appeared, encouraging the model to talk about new topics. Defaults to 0.
frequency_penaltynumberoptional
Number between -2 and 2. Positive values penalize tokens by their existing frequency, reducing verbatim repetition. Defaults to 0.
toolsarrayoptional
A list of tools (functions) the model may call. Normalized across providers, so the same tool schema works on every function-capable model. Pair with tool_choice to control invocation.
response_formatobjectoptional
Set to { type: "json_object" } to constrain the model to emit valid JSON.
userstringoptional
A stable identifier for your end user. Useful for abuse monitoring and per-user analytics.

Response

A successful request returns a chat completion object. It matches the OpenAI schema field-for-field, with one addition: the computeboard meta object. Note that model reflects the model the router actually chose — here, gpt-4.1.

response.json
{  "id": "chatcmpl_8f3a1c9e2b",  "object": "chat.completion",  "created": 1771200000,  "model": "gpt-4.1",  "choices": [    {      "index": 0,      "message": {        "role": "assistant",        "content": "Quantum entanglement is when two particles share a single state, so measuring one instantly determines the other no matter how far apart they are."      },      "finish_reason": "stop"    }  ],  "usage": {

Response fields

idstringoptional
Unique identifier for the chat completion.
objectstringoptional
Always "chat.completion" for a non-streamed response.
createdintegeroptional
Unix timestamp (seconds) of when the completion was created.
modelstringoptional
The slug of the model the router actually selected and ran. This may differ from the policy you requested (e.g. you sent "auto" and got "gpt-4.1").
choicesarrayoptional
The generated completions. Each entry has an index, a message object (with role and content, plus tool_calls when tools are used), and a finish_reason.
choices[].finish_reasonstringoptional
Why generation stopped: "stop" (natural end or stop sequence), "length" (hit max_tokens), "tool_calls", or "content_filter".
usageobjectoptional
Token accounting: prompt_tokens, completion_tokens, and total_tokens.
computeboardobjectoptional
Routing metadata unique to ComputeBoard: routed_to (the chosen model), baseline (the reference model used for savings), latency_ms, cost and baseline_cost in USD, and saved_pct — the percentage saved versus the baseline.

Examples

A multi-turn conversation with a system prompt. The system message sets behavior; the remaining messages are the dialogue history.

multi-turn.js
const completion = await client.chat.completions.create({  model: "auto",  messages: [    { role: "system", content: "You are a terse assistant. Answer in one line." },    { role: "user", content: "What is the capital of France?" },    { role: "assistant", content: "Paris." },    { role: "user", content: "And its population?" },  ],}); console.log(completion.choices[0].message.content);console.log("Handled by:", completion.computeboard.routed_to);

Forcing a routing policy. Here we ask the router to optimize purely for cost with model: "cheapest".

cheapest.py
completion = client.chat.completions.create(    model="cheapest",    messages=[        {"role": "user", "content": "Summarize this changelog in 3 bullet points."},    ],    max_tokens=200,) print(completion.choices[0].message.content)print("Saved:", completion.computeboard.saved_pct, "%")
Tip
For token-by-token output and lower time-to-first-token, set stream: true and read the response as Server-Sent Events — see the Streaming guide. To understand exactly how a model is chosen for each request, see Routing.
DocsCore APIStreaming

Streaming

Last updated June 29, 2026 Edit this page

Streaming lets you display a model's response as it is generated, token by token, instead of waiting for the full completion. ComputeBoard streams using Server-Sent Events (SSE) in the exact OpenAI chunk format, so the standard SDKs work unchanged — and the first chunk tells you which model the router selected.

Enabling streaming

Add stream: true to any chat completion request. The connection stays open and the server pushes incremental chunks as the model produces them. With the OpenAI SDK you simply iterate the returned async stream.

stream.js
import OpenAI from "openai"; const client = new OpenAI({  apiKey: process.env.COMPUTEBOARD_API_KEY,  baseURL: "https://api.computeboard.xyz/v1",}); const stream = await client.chat.completions.create({  model: "auto",  messages: [{ role: "user", content: "Write a haiku about routing." }],  stream: true,}); for await (const chunk of stream) {  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");}

SSE format

The raw response is a stream of newline-delimited events. Each event is a line beginning with data: followed by a JSON object of type chat.completion.chunk. The text for each step lives in choices[].delta.content. The stream is terminated by a final sentinel line, data: [DONE].

ComputeBoard sends one extra piece of information: the very first chunk carries a computeboard object with routed_to and baseline, so you know which model is answering before the first token arrives.

raw-stream.sh
curl https://api.computeboard.xyz/v1/chat/completions \  -H "Authorization: Bearer ck_live_..." \  -H "Content-Type: application/json" \  -d '{ "model": "auto", "stream": true, "messages": [{ "role": "user", "content": "Hi" }] }'

Handling chunks

If you are not using an SDK, read the response body as a stream, split on blank lines, strip the data: prefix, and parse the JSON — stopping when you reach [DONE]. Accumulate delta.content as it arrives.

parse-stream.js
const res = await fetch("https://api.computeboard.xyz/v1/chat/completions", {  method: "POST",  headers: {    Authorization: `Bearer ${process.env.COMPUTEBOARD_API_KEY}`,    "Content-Type": "application/json",  },  body: JSON.stringify({    model: "auto",    stream: true,    messages: [{ role: "user", content: "Stream me a sentence." }],  }),}); const reader = res.body.getReader();const decoder = new TextDecoder();let buffer = "";

When to stream

  • Chat UIs— render the assistant's reply progressively so users see words appear instead of a spinner.
  • Long outputs — for multi-paragraph answers, code, or documents, streaming avoids a long wall-clock wait for the full payload.
  • Lower time-to-first-token — the first visible token arrives far sooner, which makes interactive applications feel dramatically more responsive.

For short, non-interactive calls — classification, extraction, or backend jobs where you only consume the final result — a regular (non-streamed) request is simpler and just as fast end-to-end.

Note
ComputeBoard sends the correct streaming headers and disables response buffering on its edge, so chunks are flushed to you immediately. You do not need to configure proxies or compression on your side — just read the stream as it arrives.
DocsCore APIEmbeddings

Embeddings

Coming soon
Last updated June 29, 2026 Edit this page

Embeddings turn text into dense numeric vectors that capture meaning, so you can measure how similar two pieces of text are. ComputeBoard's embeddings endpoint will be OpenAI-compatible and routed: you send one request and the router selects the best available embedding model for your input — balancing quality, cost, and dimensionality — without you having to integrate each provider yourself.

Coming soonPOST https://api.computeboard.xyz/v1/embeddings
Not yet available
The embeddings endpoint is in active development and is not live yet. The request and response shapes below describe the planned, OpenAI-compatible interface so you can design your integration ahead of launch. Fields may change before general availability — follow the Changelog for the final specification.

Overview

A single POST /v1/embeddings call will accept one string or an array of strings and return a vector for each. Because the endpoint is OpenAI-shaped, any OpenAI embeddings client will work by only changing the base URL to https://api.computeboard.xyz/v1 and using your ck_live_ key. Set model: "auto" and the router will pick the strongest embedding model that is healthy and cost-effective for your input; you may also pin a specific embedding model by slug when you need stable, reproducible vectors across a corpus.

The router treats embeddings the same way it treats chat: it filters candidates by capability (such as a required output dimension), scores the remainder on latency, cost, and availability, and dispatches to the winner — failing over automatically if a provider is degraded. Each response includes the same computeboard metadata block you get from chat completions, telling you which model produced the vectors.

Planned request & response

Send your text as input. A single request can embed one string or a batch of strings; batching is the most efficient way to embed a large corpus.

embeddings.sh
curl https://api.computeboard.xyz/v1/embeddings \  -H "Authorization: Bearer ck_live_xxxxxxxxxxxxxxxxxxxxxxxx" \  -H "Content-Type: application/json" \  -d '{    "model": "auto",    "input": "ComputeBoard routes each request to the best model."  }'

The data array preserves input order, so data[i].embedding always corresponds to the i-th string you sent. The vector length is reported in computeboard.dimensions; store vectors from a single model together, since vectors from different models are not directly comparable.

Use cases

  • Semantic search — embed your documents and queries, then rank results by cosine similarity instead of brittle keyword matching.
  • Retrieval-augmented generation (RAG) — fetch the most relevant chunks for a question and pass them as context to a chat completion, grounding answers in your own data.
  • Clustering — group large sets of text by topic or intent for analytics, deduplication, or dataset curation.
  • Recommendations — surface similar items, articles, or products by finding the nearest-neighbour vectors to a reference embedding.
  • Classification & deduplication — use embedding distance as a fast, cheap signal for near-duplicate detection and lightweight zero-shot labelling.
Get notified at launch
Want embeddings the day they ship? Join the waitlist from the dashboard and watch the Changelog — both announce the endpoint, the supported models, and any final changes to the request shape.
DocsCore APIImage Generation

Image Generation

Coming soon
Last updated June 29, 2026 Edit this page

Image generation turns a text prompt into an image. ComputeBoard's image endpoint will be OpenAI-compatible and routed across the leading image models, so a single request can reach the best generator for your prompt — balancing quality, speed, and cost — through the same key and base URL you already use for chat.

Coming soonPOST https://api.computeboard.xyz/v1/images/generations
Not yet available
Image generation is in active development and is not live yet. The shapes below describe the planned, OpenAI-compatible interface so you can plan your integration ahead of launch. Details may change before general availability — follow the Changelog for the final specification.

Overview

A single POST /v1/images/generations call will accept a text prompt and return one or more generated images. The endpoint is OpenAI-shaped, so any OpenAI images client works by changing the base URL to https://api.computeboard.xyz/v1 and using your ck_live_ key. With model: "auto" the router scores the available image models and dispatches to the one best suited to your prompt and requested size; you can also pin a specific image model by slug, or steer with policies like "fastest" for previews and "best" for final renders.

As with every ComputeBoard endpoint, candidates are filtered by capability, scored on latency, cost, and availability, and served with automatic failover. The response carries the standard computeboard metadata block reporting which model produced the image.

Planned request & response

Provide a prompt, the output size, and how many images to return with n. The response returns a data array of image results.

images.sh
curl https://api.computeboard.xyz/v1/images/generations \  -H "Authorization: Bearer ck_live_xxxxxxxxxxxxxxxxxxxxxxxx" \  -H "Content-Type: application/json" \  -d '{    "model": "auto",    "prompt": "A neon-pink GPU floating in a dark server room, retro pixel art",    "size": "1024x1024",    "n": 1  }'

Each entry in data contains a url to the generated image (downloadable for a limited time after creation). When supported by the chosen model, a revised_prompt shows how the prompt was interpreted. Requesting more than one image with n returns multiple entries in the same array.

Use cases

  • Marketing & social assets — generate on-brand hero images, thumbnails, and ad creatives from a short description.
  • Product & concept design — explore visual concepts, mockups, and variations quickly before committing to a direction.
  • Illustration & editorial — produce custom artwork for articles, blog posts, and documentation without stock-photo licensing.
  • App & game content — create avatars, icons, textures, and placeholder art on demand.
  • Personalization — render unique imagery per user, prompt, or campaign at scale.
Get notified at launch
Image generation will be announced in the Changelog the moment it ships, including the supported models and any final changes to the request shape. Join the waitlist from the dashboard to hear first.
DocsCore APIResponses API

Responses API

Last updated June 29, 2026 Edit this page

The Responses API is a higher-level way to call ComputeBoard. Instead of assembling a message array yourself, you send a single input and get back a finished result — with the smart router, tools, and optional server-managed conversation state handled for you. It is compatible with the OpenAI Responses API, so existing Responses clients work by pointing at https://api.computeboard.xyz/v1 with your ck_live_ key.

Overview

Where Chat Completions is a stateless, message-in / message-out primitive, the Responses API is a stateful, task-oriented layer on top of it. You provide a single input (a string or a structured list), optionally attach tools, and the API runs the request through the router — selecting the best model when model: "auto"— and returns a normalized result. To continue a conversation, pass the previous response's id as previous_response_id and the server reconstructs the context for you; you never have to resend the full history.

POST/v1/responses

Every response includes a flat output_text for the common case where you just want the text, the full structured output array for tool calls and richer content, token usage, and the same computeboard metadata block you get from chat completions — so you always know which model served the request.

Request

Send your task as input. With model: "auto" the router chooses the best model; you can also use a policy ("cheapest", "fastest", "best") or pin a model slug.

responses.sh
curl https://api.computeboard.xyz/v1/responses \  -H "Authorization: Bearer ck_live_xxxxxxxxxxxxxxxxxxxxxxxx" \  -H "Content-Type: application/json" \  -d '{    "model": "auto",    "input": "Summarize what an AI gateway does in two sentences."  }'

Response

The result is normalized: read output_text for the plain answer, or walk the output array when you need to inspect tool calls and message parts. The id is what you pass as previous_response_id on the next turn.

response.json
{  "id": "resp_8x2pQ1vK4mZ",  "object": "response",  "created": 1751212800,  "model": "claude-sonnet-4.5",  "output": [    {      "type": "message",      "role": "assistant",      "content": [        {          "type": "output_text",          "text": "An AI gateway is a single API that sits between your app and many model providers. It routes each request to the best available model, so you integrate once instead of wiring up every provider yourself."        }      ]    }

Chat Completions vs Responses

Both endpoints run through the same router and return the computeboard meta. The difference is the level of abstraction: Chat Completions is a stateless primitive you control fully, while Responses manages state and orchestration for you.

AspectChat CompletionsResponses
EndpointPOST /v1/chat/completionsPOST /v1/responses
Inputmessages[] array you maintainsingle input (string or list)
StateStateless — you resend history each turnStateful — pass previous_response_id
Outputchoices[].message.contentoutput[] + flat output_text
ToolsManual: you loop tool calls yourselfBuilt-in tool orchestration
Best forFull control, existing OpenAI chat code, custom agent loopsMulti-turn apps, agents, less plumbing
Which should I use?
Reach for Responses when you want the server to manage conversation state and tool calls — it is the simplest path for multi-turn assistants and agents. Stay on Chat Completions when you need precise control over the message array, are migrating existing OpenAI chat code unchanged, or run your own agent loop. Either way, keep model: "auto" to let the router optimize each request.
DocsPlatformRouting

Routing

Last updated June 29, 2026 Edit this page

Routing is the core of ComputeBoard. Every request is evaluated by a smart router that picks the best model for that specific prompt — scoring each candidate on latency, cost, quality, and availability — then serves the result and tells you exactly which model handled it and what it saved. You send one request to one endpoint; the router does the rest.

How routing works

Each request flows through the same pipeline. You send a standard, OpenAI-shaped request; the router decides which model should serve it; that model responds; and you receive an OpenAI-shaped result with a computeboard metadata block describing the decision.

1

Request

Your app POSTs a request to https://api.computeboard.xyz/v1/chat/completions (or another routed endpoint) with model: "auto" or a routing policy.
2

Router

The router filters the model catalog down to eligiblecandidates — those that meet the request's hard requirements, such as vision input, tool/function calling, or a minimum context length.
3

Scoring

Each eligible model is scored in real time on four signals — latency, cost, quality, and availability — weighted according to the routing policy you chose.
4

Model selection

The highest-scoring healthy model wins. If it is unavailable or rate-limited at dispatch time, the router automatically falls back to the next-best candidate.
5

Response

The chosen model serves the request and ComputeBoard returns an OpenAI-shaped response, adding a computeboard block with routed_to, baseline, and saved_pct.

Scoring

For every eligible model, the router computes a live score from four signals. The relative weight of each signal shifts with the routing policy — for example "cheapest" weights cost most heavily, while "best" weights quality.

SignalEffect on the score
LatencyMeasured time-to-first-token and total response time; faster models score higher, keeping interactive workloads responsive.
CostLive per-token input/output pricing; cheaper models score higher so you don't overpay for prompts a smaller model handles well.
QualityTask-fit signals (reasoning, coding, vision, long-context); models strong enough for the prompt score higher.
AvailabilityReal-time provider health and remaining capacity; degraded or rate-limited models are penalized or excluded.

Routing policies

The model field controls how the router weights those signals. Use a policy keyword to optimize for a goal, or pass an exact model slug to pin one model and bypass routing entirely.

model valueOptimizes for
"auto"Balanced — best overall trade-off of quality, cost, speed, and reliability. The recommended default.
"cheapest"Lowest cost among models that still clear the quality bar for the request.
"fastest"Lowest latency — time-to-first-token and total response time.
"best"Highest quality — the most capable model for the task, cost aside.
"<model-slug>"Pins one specific model (e.g. claude-sonnet-4.5). No routing — used as-is, with failover only if it is down.
policies.json
// Balanced defaultlet the router decide{ "model": "auto", "messages": [/* ... */] } // Optimize for cost on bulk / background work{ "model": "cheapest", "messages": [/* ... */] } // Optimize for latency on interactive UX{ "model": "fastest", "messages": [/* ... */] } // Optimize for quality on hard reasoning tasks{ "model": "best", "messages": [/* ... */] } // Pin a specific modelbypass routing{ "model": "claude-sonnet-4.5", "messages": [/* ... */] }

Savings

Every routed response reports how much it saved versus always calling a fixed premium model. The computeboard.baseline is that reference frontier model, and saved_pct is the percentage cheaper the routed model was for this request. When the router answers a simple prompt with a small, capable model, the savings are large; when a request genuinely needs a frontier model, the router uses one and saved_pct approaches zero.

meta.json
{  // ...standard OpenAI chat completion fields...  "model": "claude-haiku-4.5",  "computeboard": {    "routed_to": "claude-haiku-4.5",  // the model that served the request    "baseline": "gpt-5",              // the premium model compared against    "saved_pct": 92.4                 // % cheaper than the baseline for this request  }}

Fallbacks

Routing is also your reliability layer. The score already accounts for availability, but if the chosen model becomes unavailable, rate-limited, or errors at dispatch time, the router automatically retries with the next-best eligible model — so a single provider outage does not surface as an error to your users.

Automatic failover
You do not configure fallbacks. When a model is down or degraded, the router picks the next-highest-scoring healthy candidate transparently, and computeboard.routed_to reflects the model that actually served the request. Pinning an exact slug disables routing, but failover still applies if that one model is unreachable.
DocsPlatformErrors

Errors

Last updated June 29, 2026 Edit this page

ComputeBoard uses conventional HTTP status codes to signal the result of a request and returns an OpenAI-style JSON error body on every failure. Codes in the 2xx range indicate success, 4xx codes indicate a problem with your request (and usually contain a message explaining how to fix it), and 5xx codes indicate a transient problem on our side that is generally safe to retry.

Because the error shape matches OpenAI's, existing error-handling code written against the OpenAI SDK works unchanged. Every error includes a human-readable message, a machine-readable type, and a stable code you can branch on.

Error codes

The table below lists every status code ComputeBoard can return, what it means, and how to resolve it.

StatusTypeMeaningHow to fix
400invalid_request_errorThe request was malformed — a missing field, an unknown parameter, or an invalid value (for example an unknown model slug).Read the message field; it names the offending parameter. Fix the payload and resend.
401authentication_errorThe API key is missing, malformed, or invalid.Send a valid key as Authorization: Bearer ck_live_…. Create or rotate keys in the dashboard.
403permission_errorThe key is valid but not permitted to perform this action (for example a restricted model or a disabled feature).Check the key's permissions and your plan, or use a key with the required scope.
404not_found_errorThe requested resource or endpoint does not exist.Verify the URL and path. Chat completions live at /v1/chat/completions.
429rate_limit_errorYou exceeded your requests-per-minute, tokens-per-minute, or monthly quota.Back off and retry after the X-RateLimit-Reset window, or upgrade your plan.
500server_errorAn unexpected error occurred inside ComputeBoard. The router could not complete the request.Retry with exponential backoff. If it persists, contact support with the request id.

Error response shape

Every error response is a JSON object with a single top-level error key. The HTTP status code and the error.code field always agree, so you can branch on either.

error.json
{  "error": {    "message": "Incorrect API key provided. You can find your key in the dashboard.",    "type": "authentication_error",    "code": "invalid_api_key"  }}

For requests that pass through the router, a unique request identifier is returned in the x-request-id response header. Include it when contacting support — it lets us trace the exact request through the routing pipeline.

Handling errors

Inspect the HTTP status and the error.code to decide whether to fix the request, re-authenticate, or retry. The OpenAI SDKs throw typed exceptions you can catch directly.

handle-errors.js
import OpenAI from "openai"; const client = new OpenAI({  apiKey: process.env.COMPUTEBOARD_API_KEY,  baseURL: "https://api.computeboard.xyz/v1",}); try {  const res = await client.chat.completions.create({    model: "auto",    messages: [{ role: "user", content: "Hello" }],  });  console.log(res.choices[0].message.content);} catch (err) {  // The OpenAI SDK surfaces status + the parsed error body.  if (err.status === 401) {
Retry 429 and 5xx with backoff
Treat 429 and 5xx responses as transient. Retry them with exponential backoff and jitter — for example wait 1s, then 2s, then 4s — and give up after a few attempts. Never retry 400, 401, or 403; those will keep failing until you fix the request or key.
DocsPlatformRate Limits

Rate Limits

Last updated June 29, 2026 Edit this page

Rate limits protect the platform and keep latency predictable for everyone. ComputeBoard meters three independent dimensions — requests per minute, tokens per minute, and a monthly token quota — and tells you exactly where you stand on every response through a set of X-RateLimit-* headers.

Limits

Your account is bound by three limits, evaluated together. Whichever you hit first applies:

LimitWhat it counts
RPMRequests per minute — the number of API calls you can make in a rolling 60-second window, regardless of size.
TPMTokens per minute — the total prompt + completion tokens you can process in a rolling 60-second window.
QuotaA monthly cap on total tokens (or spend) for the account. Resets at the start of each billing period.

Limits scale with your plan. The figures below are representative starting points — your live limits are always shown on the Usage page in the dashboard.

PlanRPMTPMMonthly quota
Free6060,0002,000,000 tokens
Pro6001,000,000100,000,000 tokens
EnterpriseCustomCustomUnlimited / negotiated

Rate limit headers

Every response includes headers describing your current limit and remaining budget for the window. Read them to pace your traffic before you hit a 429.

HeaderDescription
X-RateLimit-LimitThe maximum number of requests permitted in the current window.
X-RateLimit-RemainingThe number of requests remaining in the current window.
X-RateLimit-ResetUnix epoch seconds (or seconds remaining) until the window resets and your budget refills.
Response headers
HTTP/1.1 200 OKContent-Type: application/jsonX-RateLimit-Limit: 600X-RateLimit-Remaining: 598X-RateLimit-Reset: 1751212860

Handling 429

When you exceed a limit, ComputeBoard returns 429 with a rate_limit_error. The correct response is to wait until the reset window and retry with exponential backoff and jitter, so a burst of clients does not all retry at the same instant.

backoff.js
import OpenAI from "openai"; const client = new OpenAI({  apiKey: process.env.COMPUTEBOARD_API_KEY,  baseURL: "https://api.computeboard.xyz/v1",}); const sleep = (ms) => new Promise((r) => setTimeout(r, ms)); async function withBackoff(fn, { retries = 5, base = 500 } = {}) {  for (let attempt = 0; ; attempt++) {    try {      return await fn();    } catch (err) {      const retryable = err.status === 429 || err.status >= 500;      if (!retryable || attempt >= retries) throw err;
Need higher limits?
Limits are tied to your plan. To raise your RPM, TPM, or monthly quota, upgrade on the dashboard or contact us about an Enterprise plan for custom and negotiated limits.
DocsResourcesSDK

SDK

Last updated June 29, 2026 Edit this page

ComputeBoard does not need a bespoke SDK. Because the API is OpenAI-compatible, any OpenAI client library — official or community — works out of the box. Install the SDK for your language, point its base_url at https://api.computeboard.xyz/v1, and use a ck_live_ key. That is the only change.

Install

Install the official OpenAI SDK for your language:

JavaScript
npm install openai

Usage

Configure the client with the ComputeBoard base URL and your key, then call chat completions with model: "auto" to let the router choose. The request and response are the standard OpenAI shape.

client.js
import OpenAI from "openai"; const client = new OpenAI({  apiKey: "ck_live_xxxxxxxxxxxxxxxxxxxxxxxx",  baseURL: "https://api.computeboard.xyz/v1",}); const res = await client.chat.completions.create({  model: "auto",  messages: [{ role: "user", content: "Hello from ComputeBoard" }],}); console.log(res.choices[0].message.content);

Configuration

Only these settings differ from a default OpenAI client. Everything else is the SDK default.

SettingValueNotes
base_urlhttps://api.computeboard.xyz/v1Required. Routes all requests through ComputeBoard.
api_keyck_live_…Required. Sent as Authorization: Bearer. Create one in the dashboard.
timeout60s (recommended)Raise for long generations or large reasoning prompts; the SDK default may be short.
Same methods you already use
Because the surface is the OpenAI API, the request and response objects are identical — see Chat Completions for the full parameter and response reference, including the extra computeboard routing metadata.
DocsResourcesExamples

Examples

Last updated June 29, 2026 Edit this page

Practical, copy-pasteable recipes for common workloads. Every example runs against https://api.computeboard.xyz/v1 with a ck_live_ key and uses the smart router — either "auto" or an explicit class like "best" — so you reach the right model without hard-coding one.

Chatbot

Hold a multi-turn conversation by sending the full message history each turn. Keep the running array of messages and append the assistant's reply before the next user turn.

chatbot.js
import OpenAI from "openai"; const client = new OpenAI({  apiKey: process.env.COMPUTEBOARD_API_KEY,  baseURL: "https://api.computeboard.xyz/v1",}); // Conversation state: persists across turns.const messages = [  { role: "system", content: "You are a concise, friendly support assistant." },]; async function ask(userText) {  messages.push({ role: "user", content: userText });   const res = await client.chat.completions.create({

Summarization

Condense a long document into a few bullet points. Put the instruction in the system message and the source text in the user message; "auto" will pick a long-context model when the input is large.

summarize.py
from openai import OpenAI client = OpenAI(    api_key="ck_live_xxxxxxxxxxxxxxxxxxxxxxxx",    base_url="https://api.computeboard.xyz/v1",) with open("report.txt", "r", encoding="utf-8") as f:    document = f.read() res = client.chat.completions.create(    model="auto",    messages=[        {            "role": "system",            "content": "Summarize the user's document into 5 concise bullet points. "                       "Preserve key numbers and names.",        },        {"role": "user", "content": document},    ],) print(res.choices[0].message.content)

Translation

Use a system prompt to fix the target language and tone, then pass the text to translate. This keeps the instruction separate from user-supplied content.

translate.js
import OpenAI from "openai"; const client = new OpenAI({  apiKey: process.env.COMPUTEBOARD_API_KEY,  baseURL: "https://api.computeboard.xyz/v1",}); async function translate(text, targetLanguage) {  const res = await client.chat.completions.create({    model: "auto",    messages: [      {        role: "system",        content: `You are a professional translator. Translate the user's text into ${targetLanguage}. ` +          "Return only the translation, preserving formatting and proper nouns.",      },      { role: "user", content: text },    ],  });  return res.choices[0].message.content;} console.log(await translate("Smart routing keeps your costs low.", "Japanese"));

Coding

For code generation, route to "best" so the request lands on a frontier coding model. Function and tool calling work exactly as in the OpenAI API.

codegen.py
from openai import OpenAI client = OpenAI(    api_key="ck_live_xxxxxxxxxxxxxxxxxxxxxxxx",    base_url="https://api.computeboard.xyz/v1",) res = client.chat.completions.create(    model="best",  # frontier-class for hard code generation    messages=[        {            "role": "system",            "content": "You are an expert Python engineer. Output a single, complete function "                       "with type hints and a docstring. No prose.",        },        {            "role": "user",            "content": "Write a function that merges two sorted lists into one sorted list "                       "in O(n) time without using sorted().",        },    ],) print(res.choices[0].message.content)

Reasoning

Hard multi-step problems benefit from a strong model. Route to "best" and ask the model to work through the problem before giving the final answer.

reasoning.js
import OpenAI from "openai"; const client = new OpenAI({  apiKey: process.env.COMPUTEBOARD_API_KEY,  baseURL: "https://api.computeboard.xyz/v1",}); const res = await client.chat.completions.create({  model: "best", // route to a frontier reasoning model  messages: [    {      role: "user",      content:        "A train leaves City A at 9:00 traveling 60 km/h. Another leaves City B, " +        "300 km away, at 9:30 traveling 90 km/h toward A. At what clock time do they meet? " +        "Reason step by step, then give the final time on its own line.",    },  ],}); console.log(res.choices[0].message.content);

Vision

Send an image alongside text by using the structured content array with an image_url part. The router selects a vision-capable model automatically; only models that support images will be considered.

vision.js
import OpenAI from "openai"; const client = new OpenAI({  apiKey: process.env.COMPUTEBOARD_API_KEY,  baseURL: "https://api.computeboard.xyz/v1",}); const res = await client.chat.completions.create({  model: "auto", // router restricts to vision-capable models for image input  messages: [    {      role: "user",      content: [        { type: "text", text: "What's in this image? Describe it in one sentence." },        {          type: "image_url",
DocsResourcesWebhooks

Webhooks

Coming soon
Last updated June 29, 2026 Edit this page

Webhooks let ComputeBoard push events to your server the moment they happen — usage thresholds, key lifecycle changes, and completed requests — so you can react without polling. This feature is in active development and is not yet available.

Coming soon
Webhooks are on the roadmap and not live yet. The shapes documented below are a preview of the planned API and may change before launch. Follow the Changelog for the release announcement.

Overview

Once available, you will register one or more endpoint URLs in the dashboard and subscribe each to the event types you care about. ComputeBoard will deliver a signed JSON payload over HTTPS for every matching event. Planned event types include:

  • usage.threshold — fired when your spend or token usage crosses a configured percentage of your monthly quota (for example 75% or 90%), so you can alert or throttle before hitting the cap.
  • key.created — fired when a new API key is created on the account, for audit and security automation.
  • request.completed — fired after a completion finishes, carrying routing and usage metadata (which model served it, tokens, latency, and savings) for downstream analytics.

Planned payload

Each delivery will be a JSON object with a stable envelope — an event id, type, created timestamp, and a data object whose contents depend on the event type.

request.completed.json
{  "id": "evt_3sK1pZ9aVbN",  "type": "request.completed",  "created": 1751212800,  "data": {    "request_id": "chatcmpl_8x2pQ1vK4mZ",    "routed_to": "claude-haiku-4.5",    "baseline": "gpt-5",    "saved_pct": 92.4,    "usage": {      "prompt_tokens": 18,      "completion_tokens": 31,      "total_tokens": 49    },    "latency_ms": 412  }}

Verifying signatures

Every delivery will be signed so you can confirm it genuinely came from ComputeBoard and was not tampered with. The plan is an HMAC-SHA256 signature of the raw request body, keyed with your endpoint's signing secret and sent in an X-ComputeBoard-Signature header. Verify it by recomputing the HMAC over the exact bytes you received and comparing with a constant-time check before trusting the payload.

verify.js
import crypto from "node:crypto"; // Planned verification — header and algorithm subject to change before launch.function verifyWebhook(rawBody, signatureHeader, signingSecret) {  const expected = crypto    .createHmac("sha256", signingSecret)    .update(rawBody, "utf8")    .digest("hex");   // constant-time comparison guards against timing attacks  const a = Buffer.from(signatureHeader);  const b = Buffer.from(expected);  return a.length === b.length && crypto.timingSafeEqual(a, b);} // In an Express handler, use the RAW body (not the parsed JSON)://   const ok = verifyWebhook(req.rawBody, req.header("X-ComputeBoard-Signature"), SECRET);//   if (!ok) return res.status(400).send("invalid signature");
Get notified at launch
We will announce webhooks — including the final event catalog, retry policy, and signing scheme — in the Changelog. Until then, you can poll the Usage endpoint in the dashboard for the same data.
DocsResourcesFAQ

FAQ

Last updated June 29, 2026 Edit this page

Answers to the questions we hear most often about ComputeBoard — compatibility, routing, pricing, privacy, and the feature surface. If something is not covered here, reach out from the dashboard.

Is it really OpenAI-compatible?

Yes. ComputeBoard implements the OpenAI Chat Completions API exactly, including streaming, function and tool calling, and the standard request and response objects. You use the official OpenAI SDKs unchanged — the only difference is the base URL (https://api.computeboard.xyz/v1) and your ck_live_ key. The one addition is a small computeboard object on each response describing how the request was routed; it is purely additive and safe to ignore.

How does routing pick a model?

When you send model: "auto", the router first filters to models that can actually serve the request — matching required capabilities like vision, tool calling, and context length — then scores each candidate on four live signals: latency, cost, availability, and performance for the task. The highest-scoring healthy model wins. You can also bias the decision with the class shortcuts "cheapest", "fastest", or "best".

Can I pin a specific model?

Absolutely. Pass an exact model slug (for example claude-haiku-4.5 or gpt-5) as the model parameter and ComputeBoard sends the request straight to that model with no routing. This is useful when you need deterministic behavior, reproducibility, or a model with a specific capability. You can mix pinned and routed requests freely.

How much can I save?

It depends on your traffic. Many requests do not need a frontier model, and routing those to a cheaper-but-capable model can cut spend dramatically — often 50–90% on the eligible portion of traffic. Each response reports a saved_pct versus a fixed-frontier baseline, and the dashboard aggregates total savings over time so you can measure the real number for your workload rather than rely on an estimate.

Do you store my prompts or data?

ComputeBoard does not train on your data or sell it. Prompts and completions are processed to serve the request and to compute usage and routing metadata; we retain only what is needed to operate the service, meter billing, and provide analytics. We do not use your content to improve models. Enterprise plans support custom retention and data-handling terms.

What about latency overhead?

The routing decision is computed from pre-aggregated, continuously updated signals, so it adds only a few milliseconds before dispatch — negligible next to model inference time. In practice ComputeBoard often reduces end-to-end latency, because it steers around slow or degraded providers and can prefer a faster model when you route with "fastest" or "auto".

Which SDKs work?

Any OpenAI-compatible client. That includes the official OpenAI SDKs for JavaScript/TypeScript, Python, Go, and Rust (via community libraries such as async-openai), plus frameworks like LangChain, LlamaIndex, and the Vercel AI SDK that accept a custom base URL. If a tool can talk to OpenAI, it can talk to ComputeBoard.

How are tokens and billing counted?

Billing is metered on prompt and completion tokens, the same usage object the OpenAI API returns. Because routing may select a cheaper model, your effective cost per request is frequently lower than always using a frontier model. Every response includes a usage block, and the dashboard shows per-day, per-model, and per-key breakdowns so you can attribute spend precisely.

What happens if a model is down?

The router tracks provider health in real time. If a model is rate-limited, slow, or unavailable, it is scored down or excluded, and the request automatically fails over to the next-best healthy model instead of returning an error. This built-in redundancy is one of the main reasons teams put ComputeBoard in front of their model calls.

Do you support streaming, function calling, and vision?

Yes to all three. Streaming works via Server-Sent Events exactly as in the OpenAI API (set stream: true). Function and tool calling are passed through to any model that supports them. Vision works by sending image content parts; the router restricts the candidate set to vision-capable models for those requests.

How do rate limits work?

Each account has a requests-per-minute (RPM) limit, a tokens-per-minute (TPM) limit, and a monthly token quota, all scaled by your plan. Every response carries X-RateLimit-* headers so you can pace traffic, and exceeding a limit returns a 429 you should retry with backoff. See the Rate Limits page for details.

Is there a free tier?

Yes. The Free plan lets you try ComputeBoard with a modest RPM/TPM and a monthly token allowance — enough to integrate, test routing, and validate savings before you upgrade. When you need higher limits, the Pro and Enterprise plans raise your RPM, TPM, and quota.

DocsResourcesChangelog

Changelog

Last updated June 29, 2026 Edit this page

Notable changes to the ComputeBoard API and platform. We follow semantic versioning for the API surface and announce backward-incompatible changes here in advance.

v1.0.0 — Initial Release June 29, 2026

The first public release of ComputeBoard: one OpenAI-compatible endpoint, intelligent routing across every major model, and a full dashboard. Live at https://api.computeboard.xyz.

  • OpenAI-compatible Chat CompletionsPOST /v1/chat/completions with the standard request and response shape, plus Server-Sent Events streaming via stream: true.
  • Smart routing — send model: "auto" to route per request on latency, cost, availability, and performance, or use the class shortcuts "cheapest", "fastest", and "best".
  • 8+ models — frontier and efficient models from leading providers, all reachable through one key, with automatic failover when a provider is degraded.
  • API keys & dashboard — create, name, rotate, and revoke ck_live_ keys; manage everything from the web dashboard.
  • Usage analytics — per-day, per-model, and per-key token and cost breakdowns, plus realized savings versus a fixed-frontier baseline.
  • GPU marketplace — the foundation for renting and offering compute capacity that backs the routing network.

Coming soon

On the near-term roadmap. Dates are not yet committed; follow this page for announcements.

  • EmbeddingsPOST /v1/embeddings for vector search and retrieval.
  • Image Generation — text-to-image through the same routed endpoint.
  • Responses API (GA) — the stateful Responses interface promoted to general availability.
  • Webhooks — push events for usage thresholds, key lifecycle, and completed requests.
  • More models & SDKs — an expanding model catalog and first-class SDK helpers.
ComputeBoard
ComputeBoard
AI Compute Network
Booting compute network
0% · initializing