实时语音对话能力

更新时间：2025-04-18

目标

实现一个实时语音对话功能，支持多种语音音色。用户可以参考cookbook代码，通过AppBuilder-SDK将实时语音融入到自己的平台、应用中。

实现原理

通过循环不断处理用户的语音，将语音转文本，然后进行对话，最后将对话结果通过TTS进行播报。

使用大模型的 ASR 进行语音转文本。
使用用户自己创建的Agent进行对话，适配用户的应用场景，并具有上下文理解能力。
使用大模型的 TTS 进行文本转语音并进行播报。

前置条件

使用内置ASR、TTS组件之前，请先开通组件服务并够买额度，可参考开通组件服务
pip安装pyaudio、webrtcvad依赖包
给程序开放麦克风权限
创建好自己的Agent应用

示例代码

                JSON
                
            

                # Copyright (c) 2024 Baidu, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import time
import wave
import sys
import pyaudio
import webrtcvad
import appbuilder
import re

# 请前往千帆AppBuilder官网创建密钥，流程详见：https://cloud.baidu.com/doc/AppBuilder/s/Olq6grrt6#1%E3%80%81%E5%88%9B%E5%BB%BA%E5%AF%86%E9%92%A5
# 设置环境变量
os.environ["APPBUILDER_TOKEN"] = (
    "..."
)
# 已发布AppBuilder应用的ID
app_id = "..."
appbuilder.logger.setLoglevel("ERROR")

CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1 if sys.platform == "darwin" else 2
RATE = 16000
DURATION = 30  # ms
CHUNK = RATE // 1000 * DURATION


class Chatbot:
    def __init__(self):
        self.p = pyaudio.PyAudio()
        self.tts = appbuilder.TTS()
        self.asr = appbuilder.ASR()
        self.agent = appbuilder.AppBuilderClient(app_id)
        self.conversation_id = self.agent.create_conversation()

    def run(self):
        self.run_tts_and_play_audio(
            "我是你的专属聊天机器人，如果你有什么问题，可以直接问我"
        )
        while True:
            # Record
            audio_path = "output.wav"
            print("开始记录音频...")
            if self.record_audio(audio_path) < 1000:
                time.sleep(1)
                continue
            print("音频记录结束")

            # ASR
            print("开始执行ASR...")
            query = self.run_asr(audio_path)
            print("结束执行ASR")

            # Agent
            print("query: ", query)
            if len(query) == 0:
                continue
            answer = self.run_agent(query)
            results = re.findall(r"(https?://[^\s]+)", answer)
            for result in results:
                print("链接地址:", result)
                answer = answer.replace(result, "")
            print("answer:", answer)

            # TTS
            print("开始执行TTS并播报...")
            self.run_tts_and_play_audio(answer)
            print("结束TTS并播报结束")

    def record_audio(self, path):
        with wave.open(path, "wb") as wf:
            wf.setnchannels(CHANNELS)
            wf.setsampwidth(self.p.get_sample_size(FORMAT))
            wf.setframerate(RATE)
            stream = self.p.open(
                format=FORMAT, channels=CHANNELS, rate=RATE, input=True
            )
            vad = webrtcvad.Vad(1)
            not_speech_times = 0
            speech_times = 0
            total_times = 0
            start_up_times = 33 * 5  # 初始时间设置为5秒
            history_speech_times = 0
            while True:
                if history_speech_times > 33 * 10:
                    break
                data = stream.read(CHUNK, False)
                if vad.is_speech(data, RATE):
                    speech_times += 1
                    wf.writeframes(data)
                else:
                    not_speech_times += 1
                total_times += 1
                if total_times >= start_up_times:
                    history_speech_times += speech_times
                    # 模拟滑窗重新开始计数
                    if float(not_speech_times) / float(total_times) > 0.7:
                        break
                    not_speech_times = 0
                    speech_times = 0
                    total_times = 0
                    start_up_times = start_up_times / 2
                    if start_up_times < 33:
                        start_up_times = 33
            stream.close()
            return history_speech_times * DURATION

    def run_tts_and_play_audio(self, text: str):
        # AppBuilder内置的TTS使用文档，用户可根据文档调整参数：https://github.com/baidubce/app-builder/tree/master/python/core/components/tts
        msg = self.tts.run(
            appbuilder.Message(content={"text": text}),
            speed=5,
            pitch=5,
            volume=5,
            person=0,
            audio_type="pcm",
            model="paddlespeech-tts",
            stream=True,
        )
        stream = self.p.open(
            format=self.p.get_format_from_width(2),
            channels=1,
            rate=24000,
            output=True,
            frames_per_buffer=2048,
        )
        for pcm in msg.content:
            stream.write(pcm)
        stream.stop_stream()
        stream.close()

    # AppBuilder内置的ASR使用文档，用户可根据文档调整参数：https://github.com/baidubce/app-builder/blob/master/python/core/components/asr/README.md
    def run_asr(self, audio_path: str):
        with open(audio_path, "rb") as f:
            content_data = {"audio_format": "wav", "raw_audio": f.read(), "rate": 16000}
            msg = appbuilder.Message(content_data)
            out = self.asr.run(msg)
            text = out.content["result"][0]
            return text

    def run_agent(self, query):
        msg = self.agent.run(self.conversation_id, query, stream=True)
        answer = ""
        for content in msg.content:
            answer += content.answer
        return answer


if __name__ == "__main__":
    chatbot = Chatbot()
    chatbot.run()
            

使用方法

直接运行程序即可

用户也可以将下面的功能模块替换成自己的其他实现或模型：

record_audio: 录音
run_asr: 语音识别语音识别，AppBuilder ASR组件使用文档
run_agent: Agent对话功能，AppBuilder TTS组件使用文档
run_tts_and_play_audio：回复的语音生成并播报

AppBuilder TTS组件参数

参数名称	参数类型	是否必须	描述	示例值
message	String	是	待转成语音的文本	Message(content={"text": "需合成的文本"})
model	String	否	默认是baidu-tts模型，可选值：paddlespeech-tts、baidu-tts	paddlespeech-tts
speed	Integer	否	语音语速，默认是5中等语速，取值范围在0~15之间，仅当模型为baidu-tts参数有效，如果模型为paddlespeech-tts，参数自动失效	5
pitch	Integer	否	语音音调，默认是5中等音调，取值范围在0~15之间，仅当模型为baidu-tts参数有效，如果模型为paddlespeech-tts，参数自动失效	5
volume	Integer	否	语音音量，默认是5中等音量，取值范围在0~15之间，,仅当模型为baidu-tts参数有效，如果模型为paddlespeech-tts，参数自动失效	5
person	Integer	否	语音人物特征，默认是0(度小美),普通音库可选值包括: 0(度小美)、1(度小宇)、3(度逍遥-基础)、4(度丫丫)；精品音库包括：5003(度逍遥-精品)、5118(度小鹿)、106(度博文)、110(度小童)、111(度小萌)、103(度米朵)、5(度小娇)；臻品音库包括：4003(度逍遥-情感男声)、4106(度博文-专业男主播)、4115(度小贤-电台男主播)、4119(度小鹿-甜美女声)、4105(度灵儿-清激女声)、4117(度小乔-活泼女声)、4100(度小雯-活力女主播)、4103(度米朵-可爱女声)、4144(度姗姗-娱乐女声)、4278(度小贝-知识女主播)、4143(度清风-配音男声)、4140(度小新-专业女主播)、4129(度小彦-知识男主播)、4149(度星河-广告男声)、4254(度小清-广告女声)、4206(度博文-综艺男声)、4226(南方-电台女主播)。仅当模型为baidu-tts参数有效，如果模型为paddlespeech-tts，参数自动失效	0
audio_type	String	否	音频文件格式，如果使用baidu-tts模型可选mp3,wav; 如果使用paddlespeech-tts模型非流式返回，参数只能设为wav;如果使用paddlespeech-tts模型流式返回，参数只能设为pcm	wav
stream	Bool	否	默认是False, 目前paddlespeech-tts模型支持流式返回，baidu-tts模型不支持流式返回	False
retry	Integer	否	HTTP重试次数	3
timeout	Integer	否	HTTP超时时间	5

数字人应用调用

生态集成

百度智能云

千帆AI应用开发者中心-开始使用 qianfan-docs

千帆AI应用开发者中心-开始使用 qianfan-docs

实时语音对话能力

目标

实现原理

前置条件

示例代码

使用方法

AppBuilder TTS组件参数