Skip to Content
NEW ✨ AlphaSpace: A Step Closer towards Having Clumsy-less Robots
ResearchSpeechIchigo

🍓 Ichigo: Rethinking Speech and Language Processing

A unified model that processes voice and text in a shared token space—eliminating the need for separate automatic speech recognition (ASR) or text to speech (TTS) pipelines. Ichigo handles speech natively to deliver seamless, real-time interactions.

Why It Matters

By representing speech and language in the same system, Ichigo reduces latency and errors from multiple conversion steps. This integrated approach enables more fluid, natural exchanges between spoken and written inputs.

Feature Highlights

Tokenized speech. Converts audio into tokens that LLMs can understand natively.
Early fusion. Inspired by Meta’s Chameleon, this approach lets voice and text interact immediately, reducing errors from sequential pipelines.
Lightweight design. Just 22M ASR parameters—optimized for real-time applications on edge devices.
End-to-end conversational AI. We’ve built a foundational system that listens and speaks as naturally as it reads and writes.

By streamlining speech-language processing, Ichigo aims to reduce friction and open new possibilities for human-computer communication.

Want to see how it works under the hood? Read the research blog →

Last updated on