11 KiB

Raw Permalink Blame History Unescape Escape

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

API 文档

概述

本文档详细描述了实时语音转文字系统的各个模块和API接口。

核心模块

1. RealTimeVTT (主应用类)

位置: src/realtime_vtt.py

类定义

class RealTimeVTT:
    def __init__(self)
    def initialize(self) -> bool
    def run_interactive(self)
    def list_audio_devices(self) -> List[Dict]
    def cleanup(self)

方法说明

`init()`

初始化应用实例，创建配置对象和各个组件。

`initialize() -> bool`

初始化应用的所有组件。

返回值:

bool: 初始化成功返回 True，失败返回 False

功能:

初始化音频处理器
初始化语音识别器
设置回调函数
创建输出目录

`run_interactive()`

启动交互式语音识别会话。

功能:

开始录音
启动识别循环
处理用户输入
显示识别结果

`list_audio_devices() -> List[Dict]`

获取可用的音频设备列表。

返回值:

[
    {
        'index': int,        # 设备索引
        'name': str,         # 设备名称
        'channels': int,     # 通道数
        'sample_rate': int   # 采样率
    }
]

`cleanup()`

清理资源，停止录音和识别。

2. SpeechRecognizer (语音识别器)

位置: src/speech_recognizer.py

类定义

class SpeechRecognizer:
    def __init__(self, config: ModelConfig)
    def initialize(self) -> bool
    def create_stream(self)
    def process_audio(self, audio_data: np.ndarray)
    def set_result_callback(self, callback)
    def set_partial_result_callback(self, callback)
    def cleanup(self)

方法说明

`init(config: ModelConfig)`

初始化语音识别器。

参数:

config: ModelConfig 实例，包含模型配置信息

`initialize() -> bool`

初始化识别器和模型。

返回值:

bool: 初始化成功返回 True，失败返回 False

`create_stream()`

创建识别流。

返回值:

识别流对象

`process_audio(audio_data: np.ndarray)`

处理音频数据并进行识别。

参数:

audio_data: numpy 数组，包含音频样本数据

`set_result_callback(callback)`

设置最终识别结果回调函数。

参数:

callback: 回调函数，签名为 callback(result: RecognitionResult)

`set_partial_result_callback(callback)`

设置部分识别结果回调函数。

参数:

callback: 回调函数，签名为 callback(result: str)

3. AudioProcessor (音频处理器)

位置: src/audio_processor.py

类定义

class AudioProcessor:
    def __init__(self, config: AudioConfig)
    def initialize(self) -> bool
    def start_recording(self, callback)
    def stop_recording()
    def get_device_list(self) -> List[Dict]
    def cleanup()

方法说明

`init(config: AudioConfig)`

初始化音频处理器。

参数:

config: AudioConfig 实例，包含音频配置信息

`initialize() -> bool`

初始化音频设备。

返回值:

bool: 初始化成功返回 True，失败返回 False

`start_recording(callback)`

开始录音。

参数:

callback: 音频数据回调函数，签名为 callback(audio_data: np.ndarray)

`stop_recording()`

停止录音。

`get_device_list() -> List[Dict]`

获取音频设备列表。

返回值:

[
    {
        'index': int,
        'name': str,
        'max_input_channels': int,
        'default_sample_rate': float
    }
]

4. ModelDownloader (模型下载器)

位置: src/model_downloader.py

类定义

class ModelDownloader:
    def __init__(self, config: ModelConfig)
    def download_model(self, model_name: str, force: bool = False)
    def list_available_models(self) -> Dict
    def get_model_status(self) -> Dict
    def interactive_download()

方法说明

`download_model(model_name: str, force: bool = False)`

下载指定模型。

参数:

model_name: 模型名称
force: 是否强制重新下载

`list_available_models() -> Dict`

获取可用模型列表。

返回值:

{
    'model_key': {
        'name': str,
        'description': str,
        'size': str,
        'url': str
    }
}

`interactive_download()`

交互式模型下载。

数据结构

RecognitionResult

位置: src/speech_recognizer.py

class RecognitionResult:
    def __init__(self, text: str, timestamp: float, is_final: bool = True)
    
    # 属性
    text: str           # 识别文本
    timestamp: float    # 时间戳
    is_final: bool      # 是否为最终结果
    confidence: float   # 置信度

方法

`to_dict() -> Dict`

转换为字典格式。

返回值:

{
    'text': str,
    'timestamp': float,
    'is_final': bool,
    'confidence': float
}

`str() -> str`

返回格式化的字符串表示。

RecognitionSession

位置: src/speech_recognizer.py

class RecognitionSession:
    def __init__()
    
    # 属性
    results: List[RecognitionResult]  # 识别结果列表
    start_time: float                 # 会话开始时间
    is_active: bool                   # 会话是否活跃

方法

`add_result(result: RecognitionResult)`

添加识别结果。

`get_duration() -> float`

获取会话持续时间。

`to_dict() -> Dict`

转换为字典格式。

配置类

ModelConfig

位置: src/config.py

class ModelConfig:
    # 模型文件路径
    model_dir: Path
    tokens: str
    encoder: str
    decoder: str
    joiner: str
    
    # 语音识别参数
    sample_rate: int = 16000
    feature_dim: int = 80
    num_threads: int = 1
    
    # 端点检测参数
    enable_endpoint: bool = True
    enable_endpoint_detection: bool = True
    rule1_min_trailing_silence: float = 2.4
    rule2_min_trailing_silence: float = 1.2
    rule3_min_utterance_length: int = 300
    
    # 解码方法
    decoding_method: str = "greedy_search"
    max_active_paths: int = 4
    provider: str = "cpu"

方法

`validate_model_files() -> List[str]`

验证模型文件是否存在。

返回值:

List[str]: 缺失的文件路径列表

AudioConfig

位置: src/config.py

class AudioConfig:
    sample_rate: int = 16000      # 采样率
    chunk_size: int = 1024        # 音频块大小
    channels: int = 1             # 声道数
    format: Any = None            # 音频格式
    samples_per_read: int         # 每次读取样本数

AppConfig

位置: src/config.py

class AppConfig:
    show_partial_results: bool = True     # 显示部分结果
    show_timestamps: bool = True          # 显示时间戳
    log_level: str = "INFO"              # 日志级别
    log_file: Path                       # 日志文件路径
    output_file: Path                    # 输出文件路径
    save_to_file: bool = True            # 保存到文件

回调函数接口

音频数据回调

def audio_callback(audio_data: np.ndarray) -> None:
    """
    音频数据回调函数
    
    参数:
        audio_data: 音频数据，numpy数组，形状为 (samples,)
    """
    pass

识别结果回调

def result_callback(result: RecognitionResult) -> None:
    """
    最终识别结果回调函数
    
    参数:
        result: 识别结果对象
    """
    pass

部分识别结果回调

def partial_result_callback(text: str) -> None:
    """
    部分识别结果回调函数
    
    参数:
        text: 部分识别文本
    """
    pass

使用示例

基本使用

from src import RealTimeVTT

# 创建应用实例
app = RealTimeVTT()

# 初始化
if app.initialize():
    # 运行交互式识别
    app.run_interactive()
else:
    print("初始化失败")

# 清理资源
app.cleanup()

自定义回调

from src import SpeechRecognizer, ModelConfig, RecognitionResult

def my_result_callback(result: RecognitionResult):
    print(f"识别结果: {result.text}")
    print(f"时间戳: {result.timestamp}")
    print(f"置信度: {result.confidence}")

def my_partial_callback(text: str):
    print(f"部分结果: {text}")

# 创建识别器
config = ModelConfig()
recognizer = SpeechRecognizer(config)

# 设置回调
recognizer.set_result_callback(my_result_callback)
recognizer.set_partial_result_callback(my_partial_callback)

# 初始化并使用
if recognizer.initialize():
    # 处理音频数据
    # recognizer.process_audio(audio_data)
    pass

音频设备管理

from src import AudioProcessor, AudioConfig

# 创建音频处理器
config = AudioConfig()
processor = AudioProcessor(config)

# 初始化
if processor.initialize():
    # 获取设备列表
    devices = processor.get_device_list()
    for device in devices:
        print(f"设备 {device['index']}: {device['name']}")
    
    # 开始录音
    def audio_callback(data):
        print(f"接收到音频数据: {len(data)} 样本")
    
    processor.start_recording(audio_callback)
    
    # 停止录音
    processor.stop_recording()
    
    # 清理
    processor.cleanup()

错误处理

异常类型

系统可能抛出以下异常：

FileNotFoundError: 模型文件不存在
RuntimeError: 音频设备初始化失败
ValueError: 配置参数错误
ImportError: 依赖库缺失

错误处理示例

try:
    app = RealTimeVTT()
    if not app.initialize():
        raise RuntimeError("应用初始化失败")
    app.run_interactive()
except FileNotFoundError as e:
    print(f"文件不存在: {e}")
except RuntimeError as e:
    print(f"运行时错误: {e}")
except KeyboardInterrupt:
    print("用户中断")
finally:
    app.cleanup()

性能考虑

内存使用

模型加载约占用 200-500MB 内存
音频缓冲区约占用 10-50MB 内存
建议系统内存不少于 4GB

CPU使用

识别过程主要使用 CPU
建议使用多核 CPU
可通过 num_threads 参数调整线程数

延迟优化

调整 chunk_size 参数可影响延迟
较小的 chunk_size 延迟更低但CPU占用更高
建议值：1024-4096

扩展开发

添加新的识别模型

在 ModelDownloader.MODELS 中添加模型信息
更新模型文件映射
测试模型兼容性

添加新的音频格式支持

修改 AudioConfig 类
更新 AudioProcessor 的初始化逻辑
添加格式转换代码

添加新的输出格式

创建新的输出处理类
在 RealTimeVTT 中集成
添加相应的配置选项

本API文档涵盖了系统的主要接口和使用方法。如需更详细的信息，请参考源代码注释。

11 KiB Raw Permalink Blame History Unescape Escape

API 文档

概述

核心模块

1. RealTimeVTT (主应用类)

类定义

方法说明

__init__()

initialize() -> bool

run_interactive()

list_audio_devices() -> List[Dict]

cleanup()

2. SpeechRecognizer (语音识别器)

类定义

方法说明

__init__(config: ModelConfig)

initialize() -> bool

create_stream()

process_audio(audio_data: np.ndarray)

set_result_callback(callback)

set_partial_result_callback(callback)

3. AudioProcessor (音频处理器)

类定义

方法说明

__init__(config: AudioConfig)

initialize() -> bool

start_recording(callback)

stop_recording()

get_device_list() -> List[Dict]

4. ModelDownloader (模型下载器)

类定义

方法说明

download_model(model_name: str, force: bool = False)

list_available_models() -> Dict

interactive_download()

数据结构

RecognitionResult

方法

to_dict() -> Dict

__str__() -> str

RecognitionSession

方法

add_result(result: RecognitionResult)

get_duration() -> float

to_dict() -> Dict

配置类

ModelConfig

方法

validate_model_files() -> List[str]

AudioConfig

AppConfig

回调函数接口

音频数据回调

识别结果回调

部分识别结果回调

使用示例

基本使用

自定义回调

音频设备管理

错误处理

异常类型

错误处理示例

性能考虑

内存使用

CPU使用

延迟优化

扩展开发

添加新的识别模型

添加新的音频格式支持

添加新的输出格式

11 KiB

Raw Permalink Blame History Unescape Escape

`init()`

`initialize() -> bool`

`run_interactive()`

`list_audio_devices() -> List[Dict]`

`cleanup()`

`init(config: ModelConfig)`

`initialize() -> bool`

`create_stream()`

`process_audio(audio_data: np.ndarray)`

`set_result_callback(callback)`

`set_partial_result_callback(callback)`

`init(config: AudioConfig)`

`initialize() -> bool`

`start_recording(callback)`

`stop_recording()`

`get_device_list() -> List[Dict]`

`download_model(model_name: str, force: bool = False)`

`list_available_models() -> Dict`

`interactive_download()`

`to_dict() -> Dict`

`str() -> str`

`add_result(result: RecognitionResult)`

`get_duration() -> float`

`to_dict() -> Dict`

`validate_model_files() -> List[str]`