Back to CogniCloud
In DevelopmentQ2 2026

Inference Gateway

Deploy any open-source LLM with a single API call.

Production-grade model serving without the ops burden. Point CogniCloud Inference Gateway at any Hugging Face model ID or private weights, and get an OpenAI-compatible endpoint in seconds. Auto-scales to zero when idle and bursts to thousands of replicas under load.

Capabilities

Everything you need, nothing you don't.

1

OpenAI-compatible API

Drop-in replacement for the OpenAI Chat Completions and Embeddings APIs. Switch providers by changing one URL — no SDK changes required.

2

Continuous batching

vLLM-powered continuous batching maximises GPU throughput by processing tokens from multiple requests simultaneously, delivering 3–5× higher throughput than naive batching.

3

Speculative decoding

Pair a small draft model with your main model to predict multiple tokens per forward pass. Reduces median TTFT by up to 40% with no quality degradation.

4

Scale to zero

Deployments with no traffic incur zero cost. Sub-2-second cold start means the first request after idle still meets most latency SLOs.

5

Streaming responses

Server-sent events streaming out of the box. Compatible with any SSE client library, enabling real-time progressive text generation in your UI.

6

Model quantisation

Serve INT4, INT8, and FP8 quantised models to reduce VRAM footprint and increase throughput. AWQ and GPTQ quantisation supported natively.

Technical Specifications

Under the hood.

API formatOpenAI Chat Completions (v1)
Serving enginevLLM / TensorRT-LLM
Cold start< 2 seconds
Max context128k tokens (model dependent)
StreamingSSE, token-by-token
QuantisationFP8, INT8, INT4 (AWQ/GPTQ)
Multi-LoRAHot-swap LoRA adapters per request
Min replicas0 (scale to zero)

Inference Gateway is currently in development — estimated Q2 2026.

No pricing yet. We offer tailored solutions only.

Get notified at launch
Platform in development

Be first to
shape the future.

CogniCloud is in active development. Join the waitlist to get early access and stay updated on our roadmap. No pricing yet — we'll work with each team to find the right fit.

No spam. No pricing pitches. We reach out personally to discuss your use case.

GPU Compute
Inference APIs
Vector Search
Observability