Inference Gateway

Deploy any open-source LLM with a single API call.

Production-grade model serving without the ops burden. Point CogniCloud Inference Gateway at any Hugging Face model ID or private weights, and get an OpenAI-compatible endpoint in seconds. Auto-scales to zero when idle and bursts to thousands of replicas under load.

Join the Waitlist View Roadmap

Capabilities

Everything you need, nothing you don't.

OpenAI-compatible API

Drop-in replacement for the OpenAI Chat Completions and Embeddings APIs. Switch providers by changing one URL — no SDK changes required.

Continuous batching

vLLM-powered continuous batching maximises GPU throughput by processing tokens from multiple requests simultaneously, delivering 3–5× higher throughput than naive batching.

Speculative decoding

Pair a small draft model with your main model to predict multiple tokens per forward pass. Reduces median TTFT by up to 40% with no quality degradation.

Scale to zero

Deployments with no traffic incur zero cost. Sub-2-second cold start means the first request after idle still meets most latency SLOs.

Streaming responses

Server-sent events streaming out of the box. Compatible with any SSE client library, enabling real-time progressive text generation in your UI.

Model quantisation

Serve INT4, INT8, and FP8 quantised models to reduce VRAM footprint and increase throughput. AWQ and GPTQ quantisation supported natively.

Technical Specifications

Under the hood.

API format	OpenAI Chat Completions (v1)
Serving engine	vLLM / TensorRT-LLM
Cold start	< 2 seconds
Max context	128k tokens (model dependent)
Streaming	SSE, token-by-token
Quantisation	FP8, INT8, INT4 (AWQ/GPTQ)
Multi-LoRA	Hot-swap LoRA adapters per request
Min replicas	0 (scale to zero)

Inference Gateway is currently in development — estimated Q2 2026.

No pricing yet. We offer tailored solutions only.

Get notified at launch

Platform in development

Be first to
shape the future.

CogniCloud is in active development. Join the waitlist to get early access and stay updated on our roadmap. No pricing yet — we'll work with each team to find the right fit.

No spam. No pricing pitches. We reach out personally to discuss your use case.

GPU Compute

Inference APIs

Vector Search

Observability