All models
Text-to-SpeechOpen weights0.6B

Vikasit Voice

Text-to-speech. Natural voice generation, low-latency streaming.

Overview

Vikasit Voice is a compact text-to-speech model with natural prosody and low-latency streaming — ~97 ms first-packet latency. Supports zero-shot voice cloning from a 3-second reference.

Specifications

Total parameters
0.6B
Architecture
Multi-codebook TTS, 12Hz tokenizer
Context window
Modalities
Text in → speech out
License
Apache 2.0

Capabilities

  • Natural, expressive speech synthesis
  • Zero-shot voice cloning (3s reference)
  • Streaming with ~97 ms first-packet latency
  • 10-language coverage
10 languages (zh, en, ja, ko, de, fr, ru, pt, es, it). No Indic languages.

Benchmarks

BenchmarkScore
Avg WER (10 langs)1.84%
WER zh / en0.92 / 1.32
Speaker similarity (SIM)0.79
First-packet latency~97 ms
MOS / CMOSN/A
RTFN/A

Numbers from the Qwen3-TTS Technical Report (arXiv:2601.15621) and GitHub. MOS/CMOS and RTF are not published numerically; quality is shown via WER + speaker similarity + latency.

Hardware & deployment

PrecisionMemory
bf16~1.5 GB

Quick start

Vikasit Voice is an open-weight model. Self-host it with any OpenAI-compatible inference server and call it with the OpenAI SDK as shown below.

OpenAI-compatible Python (self-hosted, e.g. vLLM)
# pip install openai
import os
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="sk-local",  # self-hosted servers accept any token
)

resp = client.chat.completions.create(
    model="vikasit-voice",
    messages=[
        {"role": "user", "content": "Explain Vikasit Voice in one sentence."}
    ],
)

print(resp.choices[0].message.content)

Limitations

  • No Indic languages in the base (10 languages)
  • No separately published 'HD' SKU — 1.7B is the top-quality model

Vikasit Voice FAQ

How much does Vikasit Voice cost?

Vikasit Voice is an open-weight model built on Qwen3-TTS (12Hz-0.6B, Apache 2.0). Self-hosting the weights is free under the Apache 2.0 licence — you pay only for the hardware or cloud GPUs you run it on. Typical deployment fits the memory profiles listed in the hardware section above.

Is Vikasit Voice open weight?

Yes. Vikasit Voice is built on Qwen3-TTS (12Hz-0.6B, Apache 2.0) and distributed under the Apache 2.0 licence, so the weights are openly available for self-hosting, fine-tuning, and commercial use, subject to the upstream licence terms.

How do I run Vikasit Voice?

Because Vikasit Voice is open weight, you self-host it with any OpenAI-compatible inference server (such as vLLM or SGLang) loaded with the Qwen3-TTS (12Hz-0.6B, Apache 2.0) weights, then call it with the OpenAI SDK by setting the base URL to your own endpoint.

What context window does Vikasit Voice support?

Vikasit Voice supports a — context window. It is a 0.6B Multi-codebook TTS, 12Hz tokenizer model — full specifications are listed in the table above.

License & attribution

Apache 2.0

Built on Qwen3-TTS (12Hz-0.6B, Apache 2.0). Upstream copyright, license, and attribution notices are retained.