Voice Technology
Core Technology

Voice API

Text-to-Speech · Speech-to-Text

Native-level speech synthesis and real-time streaming speech recognition.
The core technology powering AI Agents, chatbots, education, conversation, and more.

Core Technology

Why Core Technology?

Voice API is not just voice conversion. It is the foundational infrastructure that determines the user experience of AI services.

🏗

Service Infrastructure

AI Agents, chatbots, conversation training, educational content, browser extensions — every service that needs voice runs on this API. API quality equals service quality.

Real-time Streaming Required

In AI 1:1 conversation, if response latency exceeds 1 second, the dialogue becomes unnatural. We target <500ms latency with WebSocket-based streaming.

🔗

Internal + External API

Beyond internal service integration, this is an independent technology asset that can be monetized by providing APIs to external clients.

Architecture

Service Architecture

💬

talking.how

AI Conversation

🗣

native.how

TTS B2C

🤖

AI Agent

Voice Agent

📚

loa.bot etc.

Chatbot / Education

API Call

native.how / API

REST API + WebSocket Streaming

/api/v1/tts /api/v1/tts/stream /api/v1/stt/stream

Wrapping

☁️

Google Cloud TTS / STT API

Seoul Region (Minimal Latency)

Text-to-Speech

TTS Technical Specs

🎙

Neural TTS

Based on WaveNet / Neural2. Naturally reproduces human intonation, emotion, and rhythm.

📡

TTS Streaming

Real-time chunk-based delivery. Even long texts start playing immediately, minimizing wait time.

🌍

100+ Languages

Korean, English, Japanese, Chinese, and 100+ languages. Various voice styles per language.

🎭

Voice Customization

Speed, pitch, volume control. SSML support for emphasis, pauses, and precise pronunciation control.

📄

Multiple Input Formats

Automatic parsing of text, PDF, and webpage URLs. SSML markup also supported.

🔊

Multiple Output Formats

MP3, WAV, OGG, FLAC, and more. Configurable bitrate and sample rate.

Speech-to-Text

STT Technical Specs

🎤

Real-time STT Streaming

WebSocket-based real-time speech recognition. Text appears instantly as you speak.

🔄

Interim Results

Real-time delivery of intermediate recognition results. Response preparation can begin before the user finishes speaking.

🧠

AI Post-processing

Automatic punctuation, word correction, and speaker diarization support.

🔇

VAD (Voice Activity)

Automatic voice segment detection. Maximizes efficiency by reducing unnecessary processing during silent periods.

📊

Confidence Score

Confidence score provided for each recognition result. Enables re-confirmation logic for low confidence segments.

🎯

Context Hints

Pre-specify domain terminology and proper nouns to improve recognition accuracy.

Streaming Pipeline

Real-time Voice Pipeline

The core of real-time voice interaction for AI conversation, voice agents, and more. Targeting total pipeline latency < 1 second.

🎤

User Voice

Mic Input

📡

STT Stream

Real-time Recognition

🧠

LLM Processing

Response Generation

🗣

TTS Stream

Speech Synthesis

🔊

Speaker Output

AI Response

Total Pipeline Target: < 1 second
API Endpoints

REST API + WebSocket

POST /api/v1/tts
POST /api/v1/tts/stream
POST /api/v1/stt
WS /api/v1/stt/stream
GET /api/v1/voices
GET /api/v1/languages

Want to learn more about Voice API?

We provide consultation on API integration, custom development, and technology partnerships.

Contact Us