ComputeBoard is a single, OpenAI-compatible endpoint that gives you access to every major AI model through one API key. Instead of integrating each provider separately and guessing which model to use, you send one request and our router picks the best model for it in real time — scoring on latency, cost, availability, and performance — then returns an OpenAI-shaped response.
One endpoint. One key. One bill. ComputeBoard sits between your application and the underlying model providers, so you can ship against a stable interface while the routing layer keeps your traffic on the fastest, cheapest, and most reliable model that meets the quality bar for each request.
What is ComputeBoard
ComputeBoard is an intelligent AI gateway. Every request to https://api.computeboard.xyz/v1/chat/completions is evaluated by a smart router that selects a model on a per-request basis. When you send model: "auto" (the default), the router weighs four criteria for the specific prompt you sent:
- Latency — measured time-to-first-token and total response time across providers, so interactive workloads stay fast.
- Cost — live per-token pricing for each candidate model, used to avoid overpaying for requests that a cheaper model can answer just as well.
- Availability — real-time provider health and capacity, with automatic failover away from rate-limited or degraded endpoints.
- Performance — task-quality signals (reasoning, coding, vision, long-context) that match the prompt to a model strong enough to handle it.
Why ComputeBoard
- One API for every model — integrate once and reach 37+ models from leading providers without writing a new client for each.
- No vendor lock-in — the interface is the standard OpenAI shape. You can pin a specific model, route to a class, or fall back to direct provider access at any time. Nothing about ComputeBoard is proprietary in your code.
- Automatic savings — by routing cheaper-but-capable models when quality allows, ComputeBoard reduces spend on requests that do not need a frontier model. Each response reports how much you saved versus a fixed-frontier baseline.
- Built-in reliability — when a provider is slow or down, the router fails over to the next-best model instead of surfacing an error to your users.
How it works
Each request flows through the same path. You send a standard chat completion; the router scores every eligible model against your prompt and current conditions; the winning model serves the request; and you receive an OpenAI-shaped response with a small computeboard metadata block that tells you exactly which model handled it and what it saved.
| Step | What happens |
|---|---|
| 1 · Request | Your app POSTs an OpenAI-shaped chat completion with model: "auto". |
| 2 · Router | Eligible models are filtered by required capabilities (vision, tools, context length). |
| 3 · Scoring | Each candidate is scored on latency, cost, availability, and performance. |
| 4 · Dispatch | The highest-scoring healthy model serves the request; failover is automatic. |
| 5 · Response | An OpenAI-shaped result returns with a computeboard meta block (routed_to, baseline, saved_pct). |
Drop-in compatible
ComputeBoard speaks the OpenAI API. If you already use an OpenAI SDK, the only change you need is the base URL and your ComputeBoard key — your existing code keeps working.
import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.COMPUTEBOARD_API_KEY, // ck_live_... baseURL: "https://api.computeboard.xyz/v1", // the only change vs OpenAI}); const res = await client.chat.completions.create({ model: "auto", messages: [{ role: "user", content: "Hello from ComputeBoard" }],}); console.log(res.choices[0].message.content);https://api.computeboard.xyz/v1, use your ck_live_ key, and every method you already call — chat completions, streaming, tools — works unchanged.Go from zero to your first routed completion in under five minutes. ComputeBoard is OpenAI-compatible, so you can use the official SDKs and only change the base URL and key.
Create an API key
Open the dashboard and go to the API Keys page. Click Create key, give it a name (for example production), and copy the key that is shown. It begins with ck_live_and is only displayed once, so store it somewhere safe — a secret manager or your deployment's environment variables.
Install the SDK
ComputeBoard works with the official OpenAI SDK. Install it for your language:
npm install openaiMake your first request
Point the client at https://api.computeboard.xyz/v1, pass your key, and send a chat completion with model: "auto" to let the router choose the best model.
import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.COMPUTEBOARD_API_KEY, // ck_live_... baseURL: "https://api.computeboard.xyz/v1",}); const res = await client.chat.completions.create({ model: "auto", messages: [ { role: "user", content: "Explain what an AI gateway does in one sentence." }, ],}); console.log(res.choices[0].message.content);console.log("routed to:", res.computeboard.routed_to);Receive the response
You get back a standard OpenAI chat completion. ComputeBoard adds a single extra field — computeboard — describing which model served the request and how much it saved versus always using a frontier model.
{ "id": "chatcmpl_8x2pQ1vK4mZ", "object": "chat.completion", "created": 1751212800, "model": "claude-haiku-4.5", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "An AI gateway is a single API that routes each request to the best available model so you don't have to integrate or choose providers yourself." }, "finish_reason": "stop" } ], "usage": {Full example
Prefer to test without an SDK? This single curl command is the complete request — copy it, paste your key, and run it.
curl https://api.computeboard.xyz/v1/chat/completions \ -H "Authorization: Bearer ck_live_xxxxxxxxxxxxxxxxxxxxxxxx" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ { "role": "user", "content": "Write a haiku about smart routing." } ] }'model as "auto" unless you have a reason not to. The router picks the best model per request and reports its choice in the computeboard meta. You can always pin a specific model — or use "cheapest", "fastest", or "best" — once you know your workload.ComputeBoard authenticates every request with a secret API key passed as a Bearer token. There are no sessions, cookies, or signatures to manage — one header authorizes your call.
API keys
Authentication uses the standard Authorization: Bearer scheme. Create a key in the dashboard, then send it on every request to https://api.computeboard.xyz/v1. Live keys are prefixed ck_live_. Requests without a valid key are rejected with 401 Unauthorized.
curl https://api.computeboard.xyz/v1/chat/completions \ -H "Authorization: Bearer ck_live_xxxxxxxxxxxxxxxxxxxxxxxx" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [{ "role": "user", "content": "ping" }] }'When you use an OpenAI SDK, the client sets this header for you — just pass your key as apiKey (JavaScript) or api_key (Python) and point baseURL at ComputeBoard.
Keeping keys secure
An API key is a credential that can spend money and read your usage. Treat it like a password and follow these practices:
- Never expose keys in client code. Browsers, mobile apps, and any shipped frontend can be inspected — a key embedded there is effectively public.
- Call ComputeBoard from your server. Proxy requests through your own backend so the key never leaves an environment you control.
- Load keys from environment variables (for example
COMPUTEBOARD_API_KEY) or a secret manager — never hard-code them in source. - Keep keys out of version control. Add
.envfiles to.gitignoreand scan commits for accidental secrets. - Scope one key per environment. Use separate keys for development, staging, and production so a leak in one place cannot affect the others.
- Rotate regularly, and immediately on any suspected exposure.
ck_live_ key can make requests billed to your account. If a key is committed to a repository, posted in a chat, or shipped to a browser, revoke it in the dashboard immediately and issue a replacement. Never log full keys or include them in error reports.Example request
A complete authenticated request in JavaScript and Python. The key is read from an environment variable so it never appears in your source code.
import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.COMPUTEBOARD_API_KEY, // ck_live_... from your environment baseURL: "https://api.computeboard.xyz/v1",}); const res = await client.chat.completions.create({ model: "auto", messages: [{ role: "user", content: "Authenticated request OK?" }],}); console.log(res.choices[0].message.content);API keys authorize your requests to ComputeBoard. You create, rotate, and revoke them from the dashboard, and you can hold as many as you need — one per service or environment is a good default.
Creating a key
Open the dashboard and go to API Keys, then click Create key. Give it a descriptive name so you can tell keys apart later (for example web-prod or worker-staging). The full secret is shown once, at creation time — copy it immediately into your secret manager or environment, because ComputeBoard stores only a hashed version and cannot display it again.
# Live key — shown once at creation, store it securelyck_live_4f8c2a9d1e7b6034a5c9f12d8e0b7a36 # Reference it from an environment variable, never inlineexport COMPUTEBOARD_API_KEY="ck_live_4f8c2a9d1e7b6034a5c9f12d8e0b7a36"After creation the dashboard only ever shows the key's prefix (for example ck_live_4f8c…) so you can identify it without revealing the secret.
| Field | Description |
|---|---|
| Name | A label you assign to identify the key (e.g. web-prod). |
| Prefix | First characters of the key, shown for identification (ck_live_4f8c…). |
| Created | Timestamp the key was issued. |
| Last used | Timestamp of the most recent request authorized by this key. |
| Status | Active or Revoked. |
Revoking a key
Revoke a key the moment it is no longer needed or you suspect it has leaked. Revocation is immediate and permanent — any request using a revoked key is rejected with 401 Unauthorized.
Open the API Keys page
Revoke it
Replace where needed
Rotating keys
Rotation replaces a key without an interruption in service. Because ComputeBoard lets you hold multiple active keys at once, you can run the old and new keys side by side during the switch — the classic create-deploy-revoke flow, with zero downtime.
Create a new key
Deploy the new key
Verify the new key is live
Revoke the old key
Best practices
- One key per service and environment. Separate keys for each app and for dev, staging, and production limit the blast radius of a leak and make usage easy to attribute.
- Minimize exposure. Keep keys server-side, load them from environment variables or a secret manager, and never embed them in client code, logs, or version control.
- Monitor usage.Watch each key's request volume and last-used time in the dashboard; unexpected activity is an early signal of a leak or a misconfiguration.
- Rotate on a schedule — and immediately on suspicion. Rotate keys periodically as a matter of hygiene, and revoke and replace any key the instant you think it may be exposed.
ComputeBoard speaks to every major model provider through one OpenAI-compatible endpoint. Instead of integrating, billing, and maintaining a separate SDK for each vendor, you call a single API and let the router put your request in front of the right model.
The fastest way to use the catalog is not to pick a model at all. Set model: "auto" and the router scores every healthy candidate on latency, cost, availability, and measured quality for your prompt, then dispatches to the best one — typically in under a millisecond of overhead. Every response tells you exactly which model answered and how much it saved versus a fixed baseline, so you keep full visibility while the platform does the work.
Prefer to stay in control? You can pin any model by its slug, or steer the router with a high-level policy like cheapest, fastest, or best. All four behaviors share the same request and response shape, so switching is a one-line change.
Available models
These models are routable today. Capabilities such as vision and tool (function) calling are normalized across providers, so the same request works no matter where it lands.
Choosing a model
The model field on a chat completion request accepts four kinds of values. Three are routing policies that leave the choice to ComputeBoard, and the fourth is a fixed model slug that pins the request to one specific model.
| model | Behavior |
|---|---|
"auto" | Balanced default. Scores every healthy model on latency, cost, availability, and prompt-fit, then routes to the best overall trade-off. Recommended for most workloads. |
"cheapest" | Optimize for cost. Routes to the lowest-priced model that can satisfy the request, falling back up the price ladder only when a cheaper model is unavailable. |
"fastest" | Optimize for latency. Routes to the model with the lowest current time-to-first-token and end-to-end latency. Ideal for interactive UIs. |
"best" | Optimize for quality. Routes to the highest-scoring model on capability and measured output quality, regardless of price. |
"gpt-4.1" | Pin a specific model by slug (e.g. gpt-4.1, claude-4, gemini-2.5). Routing is bypassed; the request always goes to that model. Use when you need deterministic, reproducible behavior. |
When you pin a slug and that model is temporarily unavailable, the request fails fast with a clear error rather than silently switching models — pinning means you get exactly what you asked for. Policies, by contrast, are designed to route around outages automatically.
Listing models
Fetch the live catalog at runtime with a standard, OpenAI-shaped list endpoint. The response includes every model slug you can pin, so you can build dynamic model pickers without hard-coding names.
curl https://api.computeboard.xyz/v1/models \ -H "Authorization: Bearer ck_live_..."GET /v1/models for the authoritative, current list, and rely on model: "auto" to keep adopting better models without code changes.Chat Completions is the core of the ComputeBoard API. It is fully compatible with the OpenAI Chat Completions schema, so any existing OpenAI SDK or integration works by changing only the base URL and API key. Send a list of messages, get a model reply — and let the router pick the best model for each request.
Every response is the standard OpenAI shape, plus one extra computeboard object that reports which model actually handled the request, the baseline it was compared against, the observed latency, and how much you saved.
Request
Set model to "auto" to let the router choose, and pass an array of messages. The examples below are identical across transports.
curl https://api.computeboard.xyz/v1/chat/completions \ -H "Authorization: Bearer ck_live_..." \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ { "role": "user", "content": "Explain quantum entanglement in one sentence." } ] }'Parameters
modelstringoptional"auto". Accepts the policies "auto", "cheapest", "fastest", "best", or a specific slug such as "gpt-4.1" to bypass routing.messagesarrayrequiredrole ("system", "user", "assistant", or "tool") and a content string (or content-part array for vision-capable models).streambooleanoptionaltrue, partial deltas are sent as Server-Sent Events instead of a single response. Defaults to false. See the Streaming guide.max_tokensintegeroptionaltemperaturenumberoptionaltop_pnumberoptionalstopstring | string[]optionalnintegeroptionalpresence_penaltynumberoptionalfrequency_penaltynumberoptionaltoolsarrayoptionaltool_choice to control invocation.response_formatobjectoptional{ type: "json_object" } to constrain the model to emit valid JSON.userstringoptionalResponse
A successful request returns a chat completion object. It matches the OpenAI schema field-for-field, with one addition: the computeboard meta object. Note that model reflects the model the router actually chose — here, gpt-4.1.
{ "id": "chatcmpl_8f3a1c9e2b", "object": "chat.completion", "created": 1771200000, "model": "gpt-4.1", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Quantum entanglement is when two particles share a single state, so measuring one instantly determines the other no matter how far apart they are." }, "finish_reason": "stop" } ], "usage": {Response fields
idstringoptionalobjectstringoptional"chat.completion" for a non-streamed response.createdintegeroptionalmodelstringoptionalchoicesarrayoptionalindex, a message object (with role and content, plus tool_calls when tools are used), and a finish_reason.choices[].finish_reasonstringoptional"stop" (natural end or stop sequence), "length" (hit max_tokens), "tool_calls", or "content_filter".usageobjectoptionalprompt_tokens, completion_tokens, and total_tokens.computeboardobjectoptionalrouted_to (the chosen model), baseline (the reference model used for savings), latency_ms, cost and baseline_cost in USD, and saved_pct — the percentage saved versus the baseline.Examples
A multi-turn conversation with a system prompt. The system message sets behavior; the remaining messages are the dialogue history.
const completion = await client.chat.completions.create({ model: "auto", messages: [ { role: "system", content: "You are a terse assistant. Answer in one line." }, { role: "user", content: "What is the capital of France?" }, { role: "assistant", content: "Paris." }, { role: "user", content: "And its population?" }, ],}); console.log(completion.choices[0].message.content);console.log("Handled by:", completion.computeboard.routed_to);Forcing a routing policy. Here we ask the router to optimize purely for cost with model: "cheapest".
completion = client.chat.completions.create( model="cheapest", messages=[ {"role": "user", "content": "Summarize this changelog in 3 bullet points."}, ], max_tokens=200,) print(completion.choices[0].message.content)print("Saved:", completion.computeboard.saved_pct, "%")stream: true and read the response as Server-Sent Events — see the Streaming guide. To understand exactly how a model is chosen for each request, see Routing.Streaming lets you display a model's response as it is generated, token by token, instead of waiting for the full completion. ComputeBoard streams using Server-Sent Events (SSE) in the exact OpenAI chunk format, so the standard SDKs work unchanged — and the first chunk tells you which model the router selected.
Enabling streaming
Add stream: true to any chat completion request. The connection stays open and the server pushes incremental chunks as the model produces them. With the OpenAI SDK you simply iterate the returned async stream.
import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.COMPUTEBOARD_API_KEY, baseURL: "https://api.computeboard.xyz/v1",}); const stream = await client.chat.completions.create({ model: "auto", messages: [{ role: "user", content: "Write a haiku about routing." }], stream: true,}); for await (const chunk of stream) { process.stdout.write(chunk.choices[0]?.delta?.content ?? "");}SSE format
The raw response is a stream of newline-delimited events. Each event is a line beginning with data: followed by a JSON object of type chat.completion.chunk. The text for each step lives in choices[].delta.content. The stream is terminated by a final sentinel line, data: [DONE].
ComputeBoard sends one extra piece of information: the very first chunk carries a computeboard object with routed_to and baseline, so you know which model is answering before the first token arrives.
curl https://api.computeboard.xyz/v1/chat/completions \ -H "Authorization: Bearer ck_live_..." \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "stream": true, "messages": [{ "role": "user", "content": "Hi" }] }'Handling chunks
If you are not using an SDK, read the response body as a stream, split on blank lines, strip the data: prefix, and parse the JSON — stopping when you reach [DONE]. Accumulate delta.content as it arrives.
const res = await fetch("https://api.computeboard.xyz/v1/chat/completions", { method: "POST", headers: { Authorization: `Bearer ${process.env.COMPUTEBOARD_API_KEY}`, "Content-Type": "application/json", }, body: JSON.stringify({ model: "auto", stream: true, messages: [{ role: "user", content: "Stream me a sentence." }], }),}); const reader = res.body.getReader();const decoder = new TextDecoder();let buffer = "";When to stream
- Chat UIs— render the assistant's reply progressively so users see words appear instead of a spinner.
- Long outputs — for multi-paragraph answers, code, or documents, streaming avoids a long wall-clock wait for the full payload.
- Lower time-to-first-token — the first visible token arrives far sooner, which makes interactive applications feel dramatically more responsive.
For short, non-interactive calls — classification, extraction, or backend jobs where you only consume the final result — a regular (non-streamed) request is simpler and just as fast end-to-end.
Embeddings turn text into dense numeric vectors that capture meaning, so you can measure how similar two pieces of text are. ComputeBoard's embeddings endpoint will be OpenAI-compatible and routed: you send one request and the router selects the best available embedding model for your input — balancing quality, cost, and dimensionality — without you having to integrate each provider yourself.
Overview
A single POST /v1/embeddings call will accept one string or an array of strings and return a vector for each. Because the endpoint is OpenAI-shaped, any OpenAI embeddings client will work by only changing the base URL to https://api.computeboard.xyz/v1 and using your ck_live_ key. Set model: "auto" and the router will pick the strongest embedding model that is healthy and cost-effective for your input; you may also pin a specific embedding model by slug when you need stable, reproducible vectors across a corpus.
The router treats embeddings the same way it treats chat: it filters candidates by capability (such as a required output dimension), scores the remainder on latency, cost, and availability, and dispatches to the winner — failing over automatically if a provider is degraded. Each response includes the same computeboard metadata block you get from chat completions, telling you which model produced the vectors.
Planned request & response
Send your text as input. A single request can embed one string or a batch of strings; batching is the most efficient way to embed a large corpus.
curl https://api.computeboard.xyz/v1/embeddings \ -H "Authorization: Bearer ck_live_xxxxxxxxxxxxxxxxxxxxxxxx" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "input": "ComputeBoard routes each request to the best model." }'The data array preserves input order, so data[i].embedding always corresponds to the i-th string you sent. The vector length is reported in computeboard.dimensions; store vectors from a single model together, since vectors from different models are not directly comparable.
Use cases
- Semantic search — embed your documents and queries, then rank results by cosine similarity instead of brittle keyword matching.
- Retrieval-augmented generation (RAG) — fetch the most relevant chunks for a question and pass them as context to a chat completion, grounding answers in your own data.
- Clustering — group large sets of text by topic or intent for analytics, deduplication, or dataset curation.
- Recommendations — surface similar items, articles, or products by finding the nearest-neighbour vectors to a reference embedding.
- Classification & deduplication — use embedding distance as a fast, cheap signal for near-duplicate detection and lightweight zero-shot labelling.
Image Generation
Coming soonImage generation turns a text prompt into an image. ComputeBoard's image endpoint will be OpenAI-compatible and routed across the leading image models, so a single request can reach the best generator for your prompt — balancing quality, speed, and cost — through the same key and base URL you already use for chat.
Overview
A single POST /v1/images/generations call will accept a text prompt and return one or more generated images. The endpoint is OpenAI-shaped, so any OpenAI images client works by changing the base URL to https://api.computeboard.xyz/v1 and using your ck_live_ key. With model: "auto" the router scores the available image models and dispatches to the one best suited to your prompt and requested size; you can also pin a specific image model by slug, or steer with policies like "fastest" for previews and "best" for final renders.
As with every ComputeBoard endpoint, candidates are filtered by capability, scored on latency, cost, and availability, and served with automatic failover. The response carries the standard computeboard metadata block reporting which model produced the image.
Planned request & response
Provide a prompt, the output size, and how many images to return with n. The response returns a data array of image results.
curl https://api.computeboard.xyz/v1/images/generations \ -H "Authorization: Bearer ck_live_xxxxxxxxxxxxxxxxxxxxxxxx" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "prompt": "A neon-pink GPU floating in a dark server room, retro pixel art", "size": "1024x1024", "n": 1 }'Each entry in data contains a url to the generated image (downloadable for a limited time after creation). When supported by the chosen model, a revised_prompt shows how the prompt was interpreted. Requesting more than one image with n returns multiple entries in the same array.
Use cases
- Marketing & social assets — generate on-brand hero images, thumbnails, and ad creatives from a short description.
- Product & concept design — explore visual concepts, mockups, and variations quickly before committing to a direction.
- Illustration & editorial — produce custom artwork for articles, blog posts, and documentation without stock-photo licensing.
- App & game content — create avatars, icons, textures, and placeholder art on demand.
- Personalization — render unique imagery per user, prompt, or campaign at scale.
The Responses API is a higher-level way to call ComputeBoard. Instead of assembling a message array yourself, you send a single input and get back a finished result — with the smart router, tools, and optional server-managed conversation state handled for you. It is compatible with the OpenAI Responses API, so existing Responses clients work by pointing at https://api.computeboard.xyz/v1 with your ck_live_ key.
Overview
Where Chat Completions is a stateless, message-in / message-out primitive, the Responses API is a stateful, task-oriented layer on top of it. You provide a single input (a string or a structured list), optionally attach tools, and the API runs the request through the router — selecting the best model when model: "auto"— and returns a normalized result. To continue a conversation, pass the previous response's id as previous_response_id and the server reconstructs the context for you; you never have to resend the full history.
Every response includes a flat output_text for the common case where you just want the text, the full structured output array for tool calls and richer content, token usage, and the same computeboard metadata block you get from chat completions — so you always know which model served the request.
Request
Send your task as input. With model: "auto" the router chooses the best model; you can also use a policy ("cheapest", "fastest", "best") or pin a model slug.
curl https://api.computeboard.xyz/v1/responses \ -H "Authorization: Bearer ck_live_xxxxxxxxxxxxxxxxxxxxxxxx" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "input": "Summarize what an AI gateway does in two sentences." }'Response
The result is normalized: read output_text for the plain answer, or walk the output array when you need to inspect tool calls and message parts. The id is what you pass as previous_response_id on the next turn.
{ "id": "resp_8x2pQ1vK4mZ", "object": "response", "created": 1751212800, "model": "claude-sonnet-4.5", "output": [ { "type": "message", "role": "assistant", "content": [ { "type": "output_text", "text": "An AI gateway is a single API that sits between your app and many model providers. It routes each request to the best available model, so you integrate once instead of wiring up every provider yourself." } ] }Chat Completions vs Responses
Both endpoints run through the same router and return the computeboard meta. The difference is the level of abstraction: Chat Completions is a stateless primitive you control fully, while Responses manages state and orchestration for you.
| Aspect | Chat Completions | Responses |
|---|---|---|
| Endpoint | POST /v1/chat/completions | POST /v1/responses |
| Input | messages[] array you maintain | single input (string or list) |
| State | Stateless — you resend history each turn | Stateful — pass previous_response_id |
| Output | choices[].message.content | output[] + flat output_text |
| Tools | Manual: you loop tool calls yourself | Built-in tool orchestration |
| Best for | Full control, existing OpenAI chat code, custom agent loops | Multi-turn apps, agents, less plumbing |
model: "auto" to let the router optimize each request.Routing is the core of ComputeBoard. Every request is evaluated by a smart router that picks the best model for that specific prompt — scoring each candidate on latency, cost, quality, and availability — then serves the result and tells you exactly which model handled it and what it saved. You send one request to one endpoint; the router does the rest.
How routing works
Each request flows through the same pipeline. You send a standard, OpenAI-shaped request; the router decides which model should serve it; that model responds; and you receive an OpenAI-shaped result with a computeboard metadata block describing the decision.
Request
https://api.computeboard.xyz/v1/chat/completions (or another routed endpoint) with model: "auto" or a routing policy.Router
Scoring
Model selection
Response
computeboard block with routed_to, baseline, and saved_pct.Scoring
For every eligible model, the router computes a live score from four signals. The relative weight of each signal shifts with the routing policy — for example "cheapest" weights cost most heavily, while "best" weights quality.
| Signal | Effect on the score |
|---|---|
| Latency | Measured time-to-first-token and total response time; faster models score higher, keeping interactive workloads responsive. |
| Cost | Live per-token input/output pricing; cheaper models score higher so you don't overpay for prompts a smaller model handles well. |
| Quality | Task-fit signals (reasoning, coding, vision, long-context); models strong enough for the prompt score higher. |
| Availability | Real-time provider health and remaining capacity; degraded or rate-limited models are penalized or excluded. |
Routing policies
The model field controls how the router weights those signals. Use a policy keyword to optimize for a goal, or pass an exact model slug to pin one model and bypass routing entirely.
| model value | Optimizes for |
|---|---|
"auto" | Balanced — best overall trade-off of quality, cost, speed, and reliability. The recommended default. |
"cheapest" | Lowest cost among models that still clear the quality bar for the request. |
"fastest" | Lowest latency — time-to-first-token and total response time. |
"best" | Highest quality — the most capable model for the task, cost aside. |
"<model-slug>" | Pins one specific model (e.g. claude-sonnet-4.5). No routing — used as-is, with failover only if it is down. |
// Balanced default — let the router decide{ "model": "auto", "messages": [/* ... */] } // Optimize for cost on bulk / background work{ "model": "cheapest", "messages": [/* ... */] } // Optimize for latency on interactive UX{ "model": "fastest", "messages": [/* ... */] } // Optimize for quality on hard reasoning tasks{ "model": "best", "messages": [/* ... */] } // Pin a specific model — bypass routing{ "model": "claude-sonnet-4.5", "messages": [/* ... */] }Savings
Every routed response reports how much it saved versus always calling a fixed premium model. The computeboard.baseline is that reference frontier model, and saved_pct is the percentage cheaper the routed model was for this request. When the router answers a simple prompt with a small, capable model, the savings are large; when a request genuinely needs a frontier model, the router uses one and saved_pct approaches zero.
{ // ...standard OpenAI chat completion fields... "model": "claude-haiku-4.5", "computeboard": { "routed_to": "claude-haiku-4.5", // the model that served the request "baseline": "gpt-5", // the premium model compared against "saved_pct": 92.4 // % cheaper than the baseline for this request }}Fallbacks
Routing is also your reliability layer. The score already accounts for availability, but if the chosen model becomes unavailable, rate-limited, or errors at dispatch time, the router automatically retries with the next-best eligible model — so a single provider outage does not surface as an error to your users.
computeboard.routed_to reflects the model that actually served the request. Pinning an exact slug disables routing, but failover still applies if that one model is unreachable.ComputeBoard uses conventional HTTP status codes to signal the result of a request and returns an OpenAI-style JSON error body on every failure. Codes in the 2xx range indicate success, 4xx codes indicate a problem with your request (and usually contain a message explaining how to fix it), and 5xx codes indicate a transient problem on our side that is generally safe to retry.
Because the error shape matches OpenAI's, existing error-handling code written against the OpenAI SDK works unchanged. Every error includes a human-readable message, a machine-readable type, and a stable code you can branch on.
Error codes
The table below lists every status code ComputeBoard can return, what it means, and how to resolve it.
| Status | Type | Meaning | How to fix |
|---|---|---|---|
| 400 | invalid_request_error | The request was malformed — a missing field, an unknown parameter, or an invalid value (for example an unknown model slug). | Read the message field; it names the offending parameter. Fix the payload and resend. |
| 401 | authentication_error | The API key is missing, malformed, or invalid. | Send a valid key as Authorization: Bearer ck_live_…. Create or rotate keys in the dashboard. |
| 403 | permission_error | The key is valid but not permitted to perform this action (for example a restricted model or a disabled feature). | Check the key's permissions and your plan, or use a key with the required scope. |
| 404 | not_found_error | The requested resource or endpoint does not exist. | Verify the URL and path. Chat completions live at /v1/chat/completions. |
| 429 | rate_limit_error | You exceeded your requests-per-minute, tokens-per-minute, or monthly quota. | Back off and retry after the X-RateLimit-Reset window, or upgrade your plan. |
| 500 | server_error | An unexpected error occurred inside ComputeBoard. The router could not complete the request. | Retry with exponential backoff. If it persists, contact support with the request id. |
Error response shape
Every error response is a JSON object with a single top-level error key. The HTTP status code and the error.code field always agree, so you can branch on either.
{ "error": { "message": "Incorrect API key provided. You can find your key in the dashboard.", "type": "authentication_error", "code": "invalid_api_key" }}For requests that pass through the router, a unique request identifier is returned in the x-request-id response header. Include it when contacting support — it lets us trace the exact request through the routing pipeline.
Handling errors
Inspect the HTTP status and the error.code to decide whether to fix the request, re-authenticate, or retry. The OpenAI SDKs throw typed exceptions you can catch directly.
import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.COMPUTEBOARD_API_KEY, baseURL: "https://api.computeboard.xyz/v1",}); try { const res = await client.chat.completions.create({ model: "auto", messages: [{ role: "user", content: "Hello" }], }); console.log(res.choices[0].message.content);} catch (err) { // The OpenAI SDK surfaces status + the parsed error body. if (err.status === 401) {429 and 5xx responses as transient. Retry them with exponential backoff and jitter — for example wait 1s, then 2s, then 4s — and give up after a few attempts. Never retry 400, 401, or 403; those will keep failing until you fix the request or key.Rate limits protect the platform and keep latency predictable for everyone. ComputeBoard meters three independent dimensions — requests per minute, tokens per minute, and a monthly token quota — and tells you exactly where you stand on every response through a set of X-RateLimit-* headers.
Limits
Your account is bound by three limits, evaluated together. Whichever you hit first applies:
| Limit | What it counts |
|---|---|
| RPM | Requests per minute — the number of API calls you can make in a rolling 60-second window, regardless of size. |
| TPM | Tokens per minute — the total prompt + completion tokens you can process in a rolling 60-second window. |
| Quota | A monthly cap on total tokens (or spend) for the account. Resets at the start of each billing period. |
Limits scale with your plan. The figures below are representative starting points — your live limits are always shown on the Usage page in the dashboard.
| Plan | RPM | TPM | Monthly quota |
|---|---|---|---|
| Free | 60 | 60,000 | 2,000,000 tokens |
| Pro | 600 | 1,000,000 | 100,000,000 tokens |
| Enterprise | Custom | Custom | Unlimited / negotiated |
Rate limit headers
Every response includes headers describing your current limit and remaining budget for the window. Read them to pace your traffic before you hit a 429.
| Header | Description |
|---|---|
| X-RateLimit-Limit | The maximum number of requests permitted in the current window. |
| X-RateLimit-Remaining | The number of requests remaining in the current window. |
| X-RateLimit-Reset | Unix epoch seconds (or seconds remaining) until the window resets and your budget refills. |
HTTP/1.1 200 OKContent-Type: application/jsonX-RateLimit-Limit: 600X-RateLimit-Remaining: 598X-RateLimit-Reset: 1751212860Handling 429
When you exceed a limit, ComputeBoard returns 429 with a rate_limit_error. The correct response is to wait until the reset window and retry with exponential backoff and jitter, so a burst of clients does not all retry at the same instant.
import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.COMPUTEBOARD_API_KEY, baseURL: "https://api.computeboard.xyz/v1",}); const sleep = (ms) => new Promise((r) => setTimeout(r, ms)); async function withBackoff(fn, { retries = 5, base = 500 } = {}) { for (let attempt = 0; ; attempt++) { try { return await fn(); } catch (err) { const retryable = err.status === 429 || err.status >= 500; if (!retryable || attempt >= retries) throw err;ComputeBoard does not need a bespoke SDK. Because the API is OpenAI-compatible, any OpenAI client library — official or community — works out of the box. Install the SDK for your language, point its base_url at https://api.computeboard.xyz/v1, and use a ck_live_ key. That is the only change.
Install
Install the official OpenAI SDK for your language:
npm install openaiUsage
Configure the client with the ComputeBoard base URL and your key, then call chat completions with model: "auto" to let the router choose. The request and response are the standard OpenAI shape.
import OpenAI from "openai"; const client = new OpenAI({ apiKey: "ck_live_xxxxxxxxxxxxxxxxxxxxxxxx", baseURL: "https://api.computeboard.xyz/v1",}); const res = await client.chat.completions.create({ model: "auto", messages: [{ role: "user", content: "Hello from ComputeBoard" }],}); console.log(res.choices[0].message.content);Configuration
Only these settings differ from a default OpenAI client. Everything else is the SDK default.
| Setting | Value | Notes |
|---|---|---|
| base_url | https://api.computeboard.xyz/v1 | Required. Routes all requests through ComputeBoard. |
| api_key | ck_live_… | Required. Sent as Authorization: Bearer. Create one in the dashboard. |
| timeout | 60s (recommended) | Raise for long generations or large reasoning prompts; the SDK default may be short. |
computeboard routing metadata.Practical, copy-pasteable recipes for common workloads. Every example runs against https://api.computeboard.xyz/v1 with a ck_live_ key and uses the smart router — either "auto" or an explicit class like "best" — so you reach the right model without hard-coding one.
Chatbot
Hold a multi-turn conversation by sending the full message history each turn. Keep the running array of messages and append the assistant's reply before the next user turn.
import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.COMPUTEBOARD_API_KEY, baseURL: "https://api.computeboard.xyz/v1",}); // Conversation state: persists across turns.const messages = [ { role: "system", content: "You are a concise, friendly support assistant." },]; async function ask(userText) { messages.push({ role: "user", content: userText }); const res = await client.chat.completions.create({Summarization
Condense a long document into a few bullet points. Put the instruction in the system message and the source text in the user message; "auto" will pick a long-context model when the input is large.
from openai import OpenAI client = OpenAI( api_key="ck_live_xxxxxxxxxxxxxxxxxxxxxxxx", base_url="https://api.computeboard.xyz/v1",) with open("report.txt", "r", encoding="utf-8") as f: document = f.read() res = client.chat.completions.create( model="auto", messages=[ { "role": "system", "content": "Summarize the user's document into 5 concise bullet points. " "Preserve key numbers and names.", }, {"role": "user", "content": document}, ],) print(res.choices[0].message.content)Translation
Use a system prompt to fix the target language and tone, then pass the text to translate. This keeps the instruction separate from user-supplied content.
import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.COMPUTEBOARD_API_KEY, baseURL: "https://api.computeboard.xyz/v1",}); async function translate(text, targetLanguage) { const res = await client.chat.completions.create({ model: "auto", messages: [ { role: "system", content: `You are a professional translator. Translate the user's text into ${targetLanguage}. ` + "Return only the translation, preserving formatting and proper nouns.", }, { role: "user", content: text }, ], }); return res.choices[0].message.content;} console.log(await translate("Smart routing keeps your costs low.", "Japanese"));Coding
For code generation, route to "best" so the request lands on a frontier coding model. Function and tool calling work exactly as in the OpenAI API.
from openai import OpenAI client = OpenAI( api_key="ck_live_xxxxxxxxxxxxxxxxxxxxxxxx", base_url="https://api.computeboard.xyz/v1",) res = client.chat.completions.create( model="best", # frontier-class for hard code generation messages=[ { "role": "system", "content": "You are an expert Python engineer. Output a single, complete function " "with type hints and a docstring. No prose.", }, { "role": "user", "content": "Write a function that merges two sorted lists into one sorted list " "in O(n) time without using sorted().", }, ],) print(res.choices[0].message.content)Reasoning
Hard multi-step problems benefit from a strong model. Route to "best" and ask the model to work through the problem before giving the final answer.
import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.COMPUTEBOARD_API_KEY, baseURL: "https://api.computeboard.xyz/v1",}); const res = await client.chat.completions.create({ model: "best", // route to a frontier reasoning model messages: [ { role: "user", content: "A train leaves City A at 9:00 traveling 60 km/h. Another leaves City B, " + "300 km away, at 9:30 traveling 90 km/h toward A. At what clock time do they meet? " + "Reason step by step, then give the final time on its own line.", }, ],}); console.log(res.choices[0].message.content);Vision
Send an image alongside text by using the structured content array with an image_url part. The router selects a vision-capable model automatically; only models that support images will be considered.
import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.COMPUTEBOARD_API_KEY, baseURL: "https://api.computeboard.xyz/v1",}); const res = await client.chat.completions.create({ model: "auto", // router restricts to vision-capable models for image input messages: [ { role: "user", content: [ { type: "text", text: "What's in this image? Describe it in one sentence." }, { type: "image_url",Webhooks let ComputeBoard push events to your server the moment they happen — usage thresholds, key lifecycle changes, and completed requests — so you can react without polling. This feature is in active development and is not yet available.
Overview
Once available, you will register one or more endpoint URLs in the dashboard and subscribe each to the event types you care about. ComputeBoard will deliver a signed JSON payload over HTTPS for every matching event. Planned event types include:
usage.threshold— fired when your spend or token usage crosses a configured percentage of your monthly quota (for example 75% or 90%), so you can alert or throttle before hitting the cap.key.created— fired when a new API key is created on the account, for audit and security automation.request.completed— fired after a completion finishes, carrying routing and usage metadata (which model served it, tokens, latency, and savings) for downstream analytics.
Planned payload
Each delivery will be a JSON object with a stable envelope — an event id, type, created timestamp, and a data object whose contents depend on the event type.
{ "id": "evt_3sK1pZ9aVbN", "type": "request.completed", "created": 1751212800, "data": { "request_id": "chatcmpl_8x2pQ1vK4mZ", "routed_to": "claude-haiku-4.5", "baseline": "gpt-5", "saved_pct": 92.4, "usage": { "prompt_tokens": 18, "completion_tokens": 31, "total_tokens": 49 }, "latency_ms": 412 }}Verifying signatures
Every delivery will be signed so you can confirm it genuinely came from ComputeBoard and was not tampered with. The plan is an HMAC-SHA256 signature of the raw request body, keyed with your endpoint's signing secret and sent in an X-ComputeBoard-Signature header. Verify it by recomputing the HMAC over the exact bytes you received and comparing with a constant-time check before trusting the payload.
import crypto from "node:crypto"; // Planned verification — header and algorithm subject to change before launch.function verifyWebhook(rawBody, signatureHeader, signingSecret) { const expected = crypto .createHmac("sha256", signingSecret) .update(rawBody, "utf8") .digest("hex"); // constant-time comparison guards against timing attacks const a = Buffer.from(signatureHeader); const b = Buffer.from(expected); return a.length === b.length && crypto.timingSafeEqual(a, b);} // In an Express handler, use the RAW body (not the parsed JSON):// const ok = verifyWebhook(req.rawBody, req.header("X-ComputeBoard-Signature"), SECRET);// if (!ok) return res.status(400).send("invalid signature");Answers to the questions we hear most often about ComputeBoard — compatibility, routing, pricing, privacy, and the feature surface. If something is not covered here, reach out from the dashboard.
Is it really OpenAI-compatible?
Yes. ComputeBoard implements the OpenAI Chat Completions API exactly, including streaming, function and tool calling, and the standard request and response objects. You use the official OpenAI SDKs unchanged — the only difference is the base URL (https://api.computeboard.xyz/v1) and your ck_live_ key. The one addition is a small computeboard object on each response describing how the request was routed; it is purely additive and safe to ignore.
How does routing pick a model?
When you send model: "auto", the router first filters to models that can actually serve the request — matching required capabilities like vision, tool calling, and context length — then scores each candidate on four live signals: latency, cost, availability, and performance for the task. The highest-scoring healthy model wins. You can also bias the decision with the class shortcuts "cheapest", "fastest", or "best".
Can I pin a specific model?
Absolutely. Pass an exact model slug (for example claude-haiku-4.5 or gpt-5) as the model parameter and ComputeBoard sends the request straight to that model with no routing. This is useful when you need deterministic behavior, reproducibility, or a model with a specific capability. You can mix pinned and routed requests freely.
How much can I save?
It depends on your traffic. Many requests do not need a frontier model, and routing those to a cheaper-but-capable model can cut spend dramatically — often 50–90% on the eligible portion of traffic. Each response reports a saved_pct versus a fixed-frontier baseline, and the dashboard aggregates total savings over time so you can measure the real number for your workload rather than rely on an estimate.
Do you store my prompts or data?
ComputeBoard does not train on your data or sell it. Prompts and completions are processed to serve the request and to compute usage and routing metadata; we retain only what is needed to operate the service, meter billing, and provide analytics. We do not use your content to improve models. Enterprise plans support custom retention and data-handling terms.
What about latency overhead?
The routing decision is computed from pre-aggregated, continuously updated signals, so it adds only a few milliseconds before dispatch — negligible next to model inference time. In practice ComputeBoard often reduces end-to-end latency, because it steers around slow or degraded providers and can prefer a faster model when you route with "fastest" or "auto".
Which SDKs work?
Any OpenAI-compatible client. That includes the official OpenAI SDKs for JavaScript/TypeScript, Python, Go, and Rust (via community libraries such as async-openai), plus frameworks like LangChain, LlamaIndex, and the Vercel AI SDK that accept a custom base URL. If a tool can talk to OpenAI, it can talk to ComputeBoard.
How are tokens and billing counted?
Billing is metered on prompt and completion tokens, the same usage object the OpenAI API returns. Because routing may select a cheaper model, your effective cost per request is frequently lower than always using a frontier model. Every response includes a usage block, and the dashboard shows per-day, per-model, and per-key breakdowns so you can attribute spend precisely.
What happens if a model is down?
The router tracks provider health in real time. If a model is rate-limited, slow, or unavailable, it is scored down or excluded, and the request automatically fails over to the next-best healthy model instead of returning an error. This built-in redundancy is one of the main reasons teams put ComputeBoard in front of their model calls.
Do you support streaming, function calling, and vision?
Yes to all three. Streaming works via Server-Sent Events exactly as in the OpenAI API (set stream: true). Function and tool calling are passed through to any model that supports them. Vision works by sending image content parts; the router restricts the candidate set to vision-capable models for those requests.
How do rate limits work?
Each account has a requests-per-minute (RPM) limit, a tokens-per-minute (TPM) limit, and a monthly token quota, all scaled by your plan. Every response carries X-RateLimit-* headers so you can pace traffic, and exceeding a limit returns a 429 you should retry with backoff. See the Rate Limits page for details.
Is there a free tier?
Yes. The Free plan lets you try ComputeBoard with a modest RPM/TPM and a monthly token allowance — enough to integrate, test routing, and validate savings before you upgrade. When you need higher limits, the Pro and Enterprise plans raise your RPM, TPM, and quota.
Notable changes to the ComputeBoard API and platform. We follow semantic versioning for the API surface and announce backward-incompatible changes here in advance.
v1.0.0 — Initial Release June 29, 2026
The first public release of ComputeBoard: one OpenAI-compatible endpoint, intelligent routing across every major model, and a full dashboard. Live at https://api.computeboard.xyz.
- OpenAI-compatible Chat Completions —
POST /v1/chat/completionswith the standard request and response shape, plus Server-Sent Events streaming viastream: true. - Smart routing — send
model: "auto"to route per request on latency, cost, availability, and performance, or use the class shortcuts"cheapest","fastest", and"best". - 8+ models — frontier and efficient models from leading providers, all reachable through one key, with automatic failover when a provider is degraded.
- API keys & dashboard — create, name, rotate, and revoke
ck_live_keys; manage everything from the web dashboard. - Usage analytics — per-day, per-model, and per-key token and cost breakdowns, plus realized savings versus a fixed-frontier baseline.
- GPU marketplace — the foundation for renting and offering compute capacity that backs the routing network.
Coming soon
On the near-term roadmap. Dates are not yet committed; follow this page for announcements.
- Embeddings —
POST /v1/embeddingsfor vector search and retrieval. - Image Generation — text-to-image through the same routed endpoint.
- Responses API (GA) — the stateful Responses interface promoted to general availability.
- Webhooks — push events for usage thresholds, key lifecycle, and completed requests.
- More models & SDKs — an expanding model catalog and first-class SDK helpers.
