StepFun ships StepAudio 2.5 realtime
- StepFun released StepAudio 2.5 Realtime on May 24, describing it as an end-to-end real-time speech model for live conversational interactions. (marktechpost.com) - StepFun’s documentation highlights customizable personas, paralinguistic perception, WebSocket delivery, and pricing of 10 yuan per 1 million uncached input tokens. (platform.stepfun.com) - Developers can access the model through StepFun’s realtime endpoint and platform documentation, which includes session configuration and streaming audio examples. (platform.stepfun.com)
StepFun released StepAudio 2.5 Realtime on May 24, adding a new real-time speech model to its platform that is designed for live voice interactions rather than batch transcription or one-shot synthesis. The Shanghai-based company describes the system as an end-to-end speech large language model, meaning audio is processed and generated within one system instead of being split across separate speech recognition, reasoning and text-to-speech stages. (marktechpost.com) MarkTechPost, which reviewed the launch, said the model supports Chinese and English and is delivered through a WebSocket API. (platform.stepfun.com) StepFun’s own documentation pitches the model around “persona” control and paralinguistic understanding. The platform page says the model is built for two-way realtime speech, supports custom voice cloning, and can incorporate cues such as hesitation, laughter and sighs into both understanding and response generation. (platform.stepfun.com) ### Why is StepFun emphasizing persona instead of just speech quality? StepFun’s platform page says StepAudio 2.5 Realtime offers “fully customizable” personas, including personality traits, verbal habits and emotional boundaries. The company frames that as part of making the model feel more like a live conversational partner than a neutral assistant. MarkTechPost reported that StepFun tied the model to roleplay-specific reinforcement learning from human feedback, or RLHF, aimed at reducing “out-of-character” drift during conversations. (marktechpost.com) The report said StepFun started with more than 10,000 authored personas and used algorithmic augmentation to expand that into a million-scale persona feature set for training. (platform.stepfun.com) ### What does “paralinguistic” understanding mean here? StepFun says the model is designed to detect non-verbal acoustic signals, including hesitation and light laughter, without requiring the user to state emotions directly. The documentation says those signals are used to produce responses that better match the speaker’s tone and conversational context. (platform.stepfun.com) MarkTechPost described paralinguistics as speech information such as tone, pace, pauses, sighs and laughter. In its account of the launch, the publication said StepFun positioned that capability as a way to infer mood and intent from how something is said, not only from the words themselves. (marktechpost.com) ### How is the product delivered to developers? StepFun’s documentation says the model is available through a bidirectional WebSocket endpoint at `/v1/realtime`, with session configuration handled through `session.update` events. The example workflow shows developers streaming audio frames into an input buffer while server-side voice activity detection triggers inference and returns audio incrementally through `response.audio.delta` events. (platform.stepfun.com) The same page lists pricing of 10 yuan per 1 million uncached input tokens, 2 yuan per 1 million cached input tokens and 70 yuan per 1 million output tokens. StepFun also says the model can be used through its Step Plan subscription path. (marktechpost.com) ### How does this fit with StepFun’s earlier audio work? StepFun’s documentation says StepAudio 2.5 Realtime inherits capabilities from StepAudio 2.5 TTS, its contextual text-to-speech model. That earlier product is described as combining global context controls for overall delivery with inline controls for sentence-level emotional and prosodic detail, plus zero-shot voice cloning from short reference audio. StepFun’s GitHub repository for Step-Audio 2 shows the company has also been developing broader audio models and benchmarks around paralinguistic understanding and tool calling. (platform.stepfun.com) The repository describes Step-Audio 2 as an end-to-end multimodal model for audio understanding and speech conversation. ### What is the practical takeaway for voice products? (platform.stepfun.com) StepFun’s product page lists use cases including emotional companionship, daily conversation, question answering and task assistants. Those examples place the model in interactive voice surfaces where timing, tone and persona are part of the product, not just add-ons to recognition accuracy. The next step for developers is already spelled out in StepFun’s documentation: the realtime endpoint, session schema and sample code are live on the company’s platform page, alongside links to the broader voice-model catalog and pricing details. (platform.stepfun.com 1) (platform.stepfun.com 2) (github.com)