|
|
|
|
# API 文档
|
|
|
|
|
|
|
|
|
|
## 概述
|
|
|
|
|
|
|
|
|
|
本文档详细描述了实时语音转文字系统的各个模块和API接口。
|
|
|
|
|
|
|
|
|
|
## 核心模块
|
|
|
|
|
|
|
|
|
|
### 1. RealTimeVTT (主应用类)
|
|
|
|
|
|
|
|
|
|
**位置**: `src/realtime_vtt.py`
|
|
|
|
|
|
|
|
|
|
#### 类定义
|
|
|
|
|
```python
|
|
|
|
|
class RealTimeVTT:
|
|
|
|
|
def __init__(self)
|
|
|
|
|
def initialize(self) -> bool
|
|
|
|
|
def run_interactive(self)
|
|
|
|
|
def list_audio_devices(self) -> List[Dict]
|
|
|
|
|
def cleanup(self)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
#### 方法说明
|
|
|
|
|
|
|
|
|
|
##### `__init__()`
|
|
|
|
|
初始化应用实例,创建配置对象和各个组件。
|
|
|
|
|
|
|
|
|
|
##### `initialize() -> bool`
|
|
|
|
|
初始化应用的所有组件。
|
|
|
|
|
|
|
|
|
|
**返回值**:
|
|
|
|
|
- `bool`: 初始化成功返回 True,失败返回 False
|
|
|
|
|
|
|
|
|
|
**功能**:
|
|
|
|
|
- 初始化音频处理器
|
|
|
|
|
- 初始化语音识别器
|
|
|
|
|
- 设置回调函数
|
|
|
|
|
- 创建输出目录
|
|
|
|
|
|
|
|
|
|
##### `run_interactive()`
|
|
|
|
|
启动交互式语音识别会话。
|
|
|
|
|
|
|
|
|
|
**功能**:
|
|
|
|
|
- 开始录音
|
|
|
|
|
- 启动识别循环
|
|
|
|
|
- 处理用户输入
|
|
|
|
|
- 显示识别结果
|
|
|
|
|
|
|
|
|
|
##### `list_audio_devices() -> List[Dict]`
|
|
|
|
|
获取可用的音频设备列表。
|
|
|
|
|
|
|
|
|
|
**返回值**:
|
|
|
|
|
```python
|
|
|
|
|
[
|
|
|
|
|
{
|
|
|
|
|
'index': int, # 设备索引
|
|
|
|
|
'name': str, # 设备名称
|
|
|
|
|
'channels': int, # 通道数
|
|
|
|
|
'sample_rate': int # 采样率
|
|
|
|
|
}
|
|
|
|
|
]
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
##### `cleanup()`
|
|
|
|
|
清理资源,停止录音和识别。
|
|
|
|
|
|
|
|
|
|
### 2. SpeechRecognizer (语音识别器)
|
|
|
|
|
|
|
|
|
|
**位置**: `src/speech_recognizer.py`
|
|
|
|
|
|
|
|
|
|
#### 类定义
|
|
|
|
|
```python
|
|
|
|
|
class SpeechRecognizer:
|
|
|
|
|
def __init__(self, config: ModelConfig)
|
|
|
|
|
def initialize(self) -> bool
|
|
|
|
|
def create_stream(self)
|
|
|
|
|
def process_audio(self, audio_data: np.ndarray)
|
|
|
|
|
def set_result_callback(self, callback)
|
|
|
|
|
def set_partial_result_callback(self, callback)
|
|
|
|
|
def cleanup(self)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
#### 方法说明
|
|
|
|
|
|
|
|
|
|
##### `__init__(config: ModelConfig)`
|
|
|
|
|
初始化语音识别器。
|
|
|
|
|
|
|
|
|
|
**参数**:
|
|
|
|
|
- `config`: ModelConfig 实例,包含模型配置信息
|
|
|
|
|
|
|
|
|
|
##### `initialize() -> bool`
|
|
|
|
|
初始化识别器和模型。
|
|
|
|
|
|
|
|
|
|
**返回值**:
|
|
|
|
|
- `bool`: 初始化成功返回 True,失败返回 False
|
|
|
|
|
|
|
|
|
|
##### `create_stream()`
|
|
|
|
|
创建识别流。
|
|
|
|
|
|
|
|
|
|
**返回值**:
|
|
|
|
|
- 识别流对象
|
|
|
|
|
|
|
|
|
|
##### `process_audio(audio_data: np.ndarray)`
|
|
|
|
|
处理音频数据并进行识别。
|
|
|
|
|
|
|
|
|
|
**参数**:
|
|
|
|
|
- `audio_data`: numpy 数组,包含音频样本数据
|
|
|
|
|
|
|
|
|
|
##### `set_result_callback(callback)`
|
|
|
|
|
设置最终识别结果回调函数。
|
|
|
|
|
|
|
|
|
|
**参数**:
|
|
|
|
|
- `callback`: 回调函数,签名为 `callback(result: RecognitionResult)`
|
|
|
|
|
|
|
|
|
|
##### `set_partial_result_callback(callback)`
|
|
|
|
|
设置部分识别结果回调函数。
|
|
|
|
|
|
|
|
|
|
**参数**:
|
|
|
|
|
- `callback`: 回调函数,签名为 `callback(result: str)`
|
|
|
|
|
|
|
|
|
|
### 3. AudioProcessor (音频处理器)
|
|
|
|
|
|
|
|
|
|
**位置**: `src/audio_processor.py`
|
|
|
|
|
|
|
|
|
|
#### 类定义
|
|
|
|
|
```python
|
|
|
|
|
class AudioProcessor:
|
|
|
|
|
def __init__(self, config: AudioConfig)
|
|
|
|
|
def initialize(self) -> bool
|
|
|
|
|
def start_recording(self, callback)
|
|
|
|
|
def stop_recording()
|
|
|
|
|
def get_device_list(self) -> List[Dict]
|
|
|
|
|
def cleanup()
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
#### 方法说明
|
|
|
|
|
|
|
|
|
|
##### `__init__(config: AudioConfig)`
|
|
|
|
|
初始化音频处理器。
|
|
|
|
|
|
|
|
|
|
**参数**:
|
|
|
|
|
- `config`: AudioConfig 实例,包含音频配置信息
|
|
|
|
|
|
|
|
|
|
##### `initialize() -> bool`
|
|
|
|
|
初始化音频设备。
|
|
|
|
|
|
|
|
|
|
**返回值**:
|
|
|
|
|
- `bool`: 初始化成功返回 True,失败返回 False
|
|
|
|
|
|
|
|
|
|
##### `start_recording(callback)`
|
|
|
|
|
开始录音。
|
|
|
|
|
|
|
|
|
|
**参数**:
|
|
|
|
|
- `callback`: 音频数据回调函数,签名为 `callback(audio_data: np.ndarray)`
|
|
|
|
|
|
|
|
|
|
##### `stop_recording()`
|
|
|
|
|
停止录音。
|
|
|
|
|
|
|
|
|
|
##### `get_device_list() -> List[Dict]`
|
|
|
|
|
获取音频设备列表。
|
|
|
|
|
|
|
|
|
|
**返回值**:
|
|
|
|
|
```python
|
|
|
|
|
[
|
|
|
|
|
{
|
|
|
|
|
'index': int,
|
|
|
|
|
'name': str,
|
|
|
|
|
'max_input_channels': int,
|
|
|
|
|
'default_sample_rate': float
|
|
|
|
|
}
|
|
|
|
|
]
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### 4. ModelDownloader (模型下载器)
|
|
|
|
|
|
|
|
|
|
**位置**: `src/model_downloader.py`
|
|
|
|
|
|
|
|
|
|
#### 类定义
|
|
|
|
|
```python
|
|
|
|
|
class ModelDownloader:
|
|
|
|
|
def __init__(self, config: ModelConfig)
|
|
|
|
|
def download_model(self, model_name: str, force: bool = False)
|
|
|
|
|
def list_available_models(self) -> Dict
|
|
|
|
|
def get_model_status(self) -> Dict
|
|
|
|
|
def interactive_download()
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
#### 方法说明
|
|
|
|
|
|
|
|
|
|
##### `download_model(model_name: str, force: bool = False)`
|
|
|
|
|
下载指定模型。
|
|
|
|
|
|
|
|
|
|
**参数**:
|
|
|
|
|
- `model_name`: 模型名称
|
|
|
|
|
- `force`: 是否强制重新下载
|
|
|
|
|
|
|
|
|
|
##### `list_available_models() -> Dict`
|
|
|
|
|
获取可用模型列表。
|
|
|
|
|
|
|
|
|
|
**返回值**:
|
|
|
|
|
```python
|
|
|
|
|
{
|
|
|
|
|
'model_key': {
|
|
|
|
|
'name': str,
|
|
|
|
|
'description': str,
|
|
|
|
|
'size': str,
|
|
|
|
|
'url': str
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
##### `interactive_download()`
|
|
|
|
|
交互式模型下载。
|
|
|
|
|
|
|
|
|
|
## 数据结构
|
|
|
|
|
|
|
|
|
|
### RecognitionResult
|
|
|
|
|
|
|
|
|
|
**位置**: `src/speech_recognizer.py`
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
class RecognitionResult:
|
|
|
|
|
def __init__(self, text: str, timestamp: float, is_final: bool = True)
|
|
|
|
|
|
|
|
|
|
# 属性
|
|
|
|
|
text: str # 识别文本
|
|
|
|
|
timestamp: float # 时间戳
|
|
|
|
|
is_final: bool # 是否为最终结果
|
|
|
|
|
confidence: float # 置信度
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
#### 方法
|
|
|
|
|
|
|
|
|
|
##### `to_dict() -> Dict`
|
|
|
|
|
转换为字典格式。
|
|
|
|
|
|
|
|
|
|
**返回值**:
|
|
|
|
|
```python
|
|
|
|
|
{
|
|
|
|
|
'text': str,
|
|
|
|
|
'timestamp': float,
|
|
|
|
|
'is_final': bool,
|
|
|
|
|
'confidence': float
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
##### `__str__() -> str`
|
|
|
|
|
返回格式化的字符串表示。
|
|
|
|
|
|
|
|
|
|
### RecognitionSession
|
|
|
|
|
|
|
|
|
|
**位置**: `src/speech_recognizer.py`
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
class RecognitionSession:
|
|
|
|
|
def __init__()
|
|
|
|
|
|
|
|
|
|
# 属性
|
|
|
|
|
results: List[RecognitionResult] # 识别结果列表
|
|
|
|
|
start_time: float # 会话开始时间
|
|
|
|
|
is_active: bool # 会话是否活跃
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
#### 方法
|
|
|
|
|
|
|
|
|
|
##### `add_result(result: RecognitionResult)`
|
|
|
|
|
添加识别结果。
|
|
|
|
|
|
|
|
|
|
##### `get_duration() -> float`
|
|
|
|
|
获取会话持续时间。
|
|
|
|
|
|
|
|
|
|
##### `to_dict() -> Dict`
|
|
|
|
|
转换为字典格式。
|
|
|
|
|
|
|
|
|
|
## 配置类
|
|
|
|
|
|
|
|
|
|
### ModelConfig
|
|
|
|
|
|
|
|
|
|
**位置**: `src/config.py`
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
class ModelConfig:
|
|
|
|
|
# 模型文件路径
|
|
|
|
|
model_dir: Path
|
|
|
|
|
tokens: str
|
|
|
|
|
encoder: str
|
|
|
|
|
decoder: str
|
|
|
|
|
joiner: str
|
|
|
|
|
|
|
|
|
|
# 语音识别参数
|
|
|
|
|
sample_rate: int = 16000
|
|
|
|
|
feature_dim: int = 80
|
|
|
|
|
num_threads: int = 1
|
|
|
|
|
|
|
|
|
|
# 端点检测参数
|
|
|
|
|
enable_endpoint: bool = True
|
|
|
|
|
enable_endpoint_detection: bool = True
|
|
|
|
|
rule1_min_trailing_silence: float = 2.4
|
|
|
|
|
rule2_min_trailing_silence: float = 1.2
|
|
|
|
|
rule3_min_utterance_length: int = 300
|
|
|
|
|
|
|
|
|
|
# 解码方法
|
|
|
|
|
decoding_method: str = "greedy_search"
|
|
|
|
|
max_active_paths: int = 4
|
|
|
|
|
provider: str = "cpu"
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
#### 方法
|
|
|
|
|
|
|
|
|
|
##### `validate_model_files() -> List[str]`
|
|
|
|
|
验证模型文件是否存在。
|
|
|
|
|
|
|
|
|
|
**返回值**:
|
|
|
|
|
- `List[str]`: 缺失的文件路径列表
|
|
|
|
|
|
|
|
|
|
### AudioConfig
|
|
|
|
|
|
|
|
|
|
**位置**: `src/config.py`
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
class AudioConfig:
|
|
|
|
|
sample_rate: int = 16000 # 采样率
|
|
|
|
|
chunk_size: int = 1024 # 音频块大小
|
|
|
|
|
channels: int = 1 # 声道数
|
|
|
|
|
format: Any = None # 音频格式
|
|
|
|
|
samples_per_read: int # 每次读取样本数
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### AppConfig
|
|
|
|
|
|
|
|
|
|
**位置**: `src/config.py`
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
class AppConfig:
|
|
|
|
|
show_partial_results: bool = True # 显示部分结果
|
|
|
|
|
show_timestamps: bool = True # 显示时间戳
|
|
|
|
|
log_level: str = "INFO" # 日志级别
|
|
|
|
|
log_file: Path # 日志文件路径
|
|
|
|
|
output_file: Path # 输出文件路径
|
|
|
|
|
save_to_file: bool = True # 保存到文件
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## 回调函数接口
|
|
|
|
|
|
|
|
|
|
### 音频数据回调
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
def audio_callback(audio_data: np.ndarray) -> None:
|
|
|
|
|
"""
|
|
|
|
|
音频数据回调函数
|
|
|
|
|
|
|
|
|
|
参数:
|
|
|
|
|
audio_data: 音频数据,numpy数组,形状为 (samples,)
|
|
|
|
|
"""
|
|
|
|
|
pass
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### 识别结果回调
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
def result_callback(result: RecognitionResult) -> None:
|
|
|
|
|
"""
|
|
|
|
|
最终识别结果回调函数
|
|
|
|
|
|
|
|
|
|
参数:
|
|
|
|
|
result: 识别结果对象
|
|
|
|
|
"""
|
|
|
|
|
pass
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### 部分识别结果回调
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
def partial_result_callback(text: str) -> None:
|
|
|
|
|
"""
|
|
|
|
|
部分识别结果回调函数
|
|
|
|
|
|
|
|
|
|
参数:
|
|
|
|
|
text: 部分识别文本
|
|
|
|
|
"""
|
|
|
|
|
pass
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## 使用示例
|
|
|
|
|
|
|
|
|
|
### 基本使用
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
from src import RealTimeVTT
|
|
|
|
|
|
|
|
|
|
# 创建应用实例
|
|
|
|
|
app = RealTimeVTT()
|
|
|
|
|
|
|
|
|
|
# 初始化
|
|
|
|
|
if app.initialize():
|
|
|
|
|
# 运行交互式识别
|
|
|
|
|
app.run_interactive()
|
|
|
|
|
else:
|
|
|
|
|
print("初始化失败")
|
|
|
|
|
|
|
|
|
|
# 清理资源
|
|
|
|
|
app.cleanup()
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### 自定义回调
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
from src import SpeechRecognizer, ModelConfig, RecognitionResult
|
|
|
|
|
|
|
|
|
|
def my_result_callback(result: RecognitionResult):
|
|
|
|
|
print(f"识别结果: {result.text}")
|
|
|
|
|
print(f"时间戳: {result.timestamp}")
|
|
|
|
|
print(f"置信度: {result.confidence}")
|
|
|
|
|
|
|
|
|
|
def my_partial_callback(text: str):
|
|
|
|
|
print(f"部分结果: {text}")
|
|
|
|
|
|
|
|
|
|
# 创建识别器
|
|
|
|
|
config = ModelConfig()
|
|
|
|
|
recognizer = SpeechRecognizer(config)
|
|
|
|
|
|
|
|
|
|
# 设置回调
|
|
|
|
|
recognizer.set_result_callback(my_result_callback)
|
|
|
|
|
recognizer.set_partial_result_callback(my_partial_callback)
|
|
|
|
|
|
|
|
|
|
# 初始化并使用
|
|
|
|
|
if recognizer.initialize():
|
|
|
|
|
# 处理音频数据
|
|
|
|
|
# recognizer.process_audio(audio_data)
|
|
|
|
|
pass
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### 音频设备管理
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
from src import AudioProcessor, AudioConfig
|
|
|
|
|
|
|
|
|
|
# 创建音频处理器
|
|
|
|
|
config = AudioConfig()
|
|
|
|
|
processor = AudioProcessor(config)
|
|
|
|
|
|
|
|
|
|
# 初始化
|
|
|
|
|
if processor.initialize():
|
|
|
|
|
# 获取设备列表
|
|
|
|
|
devices = processor.get_device_list()
|
|
|
|
|
for device in devices:
|
|
|
|
|
print(f"设备 {device['index']}: {device['name']}")
|
|
|
|
|
|
|
|
|
|
# 开始录音
|
|
|
|
|
def audio_callback(data):
|
|
|
|
|
print(f"接收到音频数据: {len(data)} 样本")
|
|
|
|
|
|
|
|
|
|
processor.start_recording(audio_callback)
|
|
|
|
|
|
|
|
|
|
# 停止录音
|
|
|
|
|
processor.stop_recording()
|
|
|
|
|
|
|
|
|
|
# 清理
|
|
|
|
|
processor.cleanup()
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## 错误处理
|
|
|
|
|
|
|
|
|
|
### 异常类型
|
|
|
|
|
|
|
|
|
|
系统可能抛出以下异常:
|
|
|
|
|
|
|
|
|
|
- `FileNotFoundError`: 模型文件不存在
|
|
|
|
|
- `RuntimeError`: 音频设备初始化失败
|
|
|
|
|
- `ValueError`: 配置参数错误
|
|
|
|
|
- `ImportError`: 依赖库缺失
|
|
|
|
|
|
|
|
|
|
### 错误处理示例
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
try:
|
|
|
|
|
app = RealTimeVTT()
|
|
|
|
|
if not app.initialize():
|
|
|
|
|
raise RuntimeError("应用初始化失败")
|
|
|
|
|
app.run_interactive()
|
|
|
|
|
except FileNotFoundError as e:
|
|
|
|
|
print(f"文件不存在: {e}")
|
|
|
|
|
except RuntimeError as e:
|
|
|
|
|
print(f"运行时错误: {e}")
|
|
|
|
|
except KeyboardInterrupt:
|
|
|
|
|
print("用户中断")
|
|
|
|
|
finally:
|
|
|
|
|
app.cleanup()
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## 性能考虑
|
|
|
|
|
|
|
|
|
|
### 内存使用
|
|
|
|
|
- 模型加载约占用 200-500MB 内存
|
|
|
|
|
- 音频缓冲区约占用 10-50MB 内存
|
|
|
|
|
- 建议系统内存不少于 4GB
|
|
|
|
|
|
|
|
|
|
### CPU使用
|
|
|
|
|
- 识别过程主要使用 CPU
|
|
|
|
|
- 建议使用多核 CPU
|
|
|
|
|
- 可通过 `num_threads` 参数调整线程数
|
|
|
|
|
|
|
|
|
|
### 延迟优化
|
|
|
|
|
- 调整 `chunk_size` 参数可影响延迟
|
|
|
|
|
- 较小的 `chunk_size` 延迟更低但CPU占用更高
|
|
|
|
|
- 建议值:1024-4096
|
|
|
|
|
|
|
|
|
|
## 扩展开发
|
|
|
|
|
|
|
|
|
|
### 添加新的识别模型
|
|
|
|
|
|
|
|
|
|
1. 在 `ModelDownloader.MODELS` 中添加模型信息
|
|
|
|
|
2. 更新模型文件映射
|
|
|
|
|
3. 测试模型兼容性
|
|
|
|
|
|
|
|
|
|
### 添加新的音频格式支持
|
|
|
|
|
|
|
|
|
|
1. 修改 `AudioConfig` 类
|
|
|
|
|
2. 更新 `AudioProcessor` 的初始化逻辑
|
|
|
|
|
3. 添加格式转换代码
|
|
|
|
|
|
|
|
|
|
### 添加新的输出格式
|
|
|
|
|
|
|
|
|
|
1. 创建新的输出处理类
|
|
|
|
|
2. 在 `RealTimeVTT` 中集成
|
|
|
|
|
3. 添加相应的配置选项
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
本API文档涵盖了系统的主要接口和使用方法。如需更详细的信息,请参考源代码注释。
|