
IonRouter is a high-throughput, low-cost inference platform designed to serve any AI model at a fraction of the market rate. It provides a drop-in OpenAI-compatible API that allows teams to leverage the best open models for large language models (LLMs), vision, video, and text-to-speech (TTS) applications. With its custom inference engine, IonAttention, built specifically for NVIDIA Grace Hopper, IonRouter reduces both cost and latency for your workloads.
Teams can deploy agents, multi-modal applications, and fine-tuned models on IonRouter's fleet while the platform handles optimization and scaling in the background. The service supports custom models, LoRAs, and any open-source model, with dedicated GPU streams and per-second billing for greater flexibility and efficiency.
IonRouter acts as a middleware layer that routes requests to the most appropriate AI model based on workload and user needs. It uses its proprietary inference engine, IonAttention, which is optimized for NVIDIA Grace Hopper hardware. This engine enables efficient model multiplexing on a single GPU, reducing latency and increasing throughput. IonRouter also dynamically adapts to traffic patterns, ensuring optimal resource allocation and performance.
The platform allows developers to integrate it into their existing workflows with minimal effort, using a simple API change. Teams can run complex applications such as robotics perception, real-time video analysis, game asset generation, and AI video pipelines with ease.
| Use Case | Description |
|---|---|
| Robotics Perception | High-performance vision-language models for real-time robotic decision-making |
| Surveillance | Multi-stream video analysis for security and monitoring systems |
| Game Asset Generation | On-demand creation of game assets using AI models |
| AI Video Pipelines | Efficient processing of text-to-video and image-to-video content |
| Multi-modal Apps | Integration of LLMs, vision, and audio models in a single application |
| Fine-tuned Models | Deployment of custom models with dedicated GPU resources |
| Low-Cost Inference | Pay-per-token pricing with no idle costs and reduced latency |
| Model | Throughput | Cost (In/Out) | Try in Playground |
|---|---|---|---|
| GLM-5 | ~220 tok/s | $1.20 in · $3.50 out | Try |
| Kimi-K2.5 | ~120 tok/s | $0.20 in · $1.60 out | Try |
| MiniMax-M2.5 | ~120 tok/s | $0.40 in · $1.50 out | Try |
| Qwen3.5-122B-A10B | ~120 tok/s | $0.20 in · $1.60 out | Try |
| GPT-OSS-120B | ~100 tok/s | $0.020 in · $0.095 out | Try |
| Wan2.2 Text-to-Video | ~8s/clip | $0.00194 / GPU·sec | Try |
| Flux Schnell | ~3s/image | ~$0.005 per image | Try |