realTimeVTT/docs/API文档.md

# API 文档

## 概述

本文档详细描述了实时语音转文字系统的各个模块和API接口。

## 核心模块

### 1. RealTimeVTT (主应用类)

**位置**: `src/realtime_vtt.py`

#### 类定义
```python
class RealTimeVTT:
    def __init__(self)
    def initialize(self) -> bool
    def run_interactive(self)
    def list_audio_devices(self) -> List[Dict]
    def cleanup(self)
```

#### 方法说明

##### `__init__()`
初始化应用实例，创建配置对象和各个组件。

##### `initialize() -> bool`
初始化应用的所有组件。

**返回值**:
- `bool`: 初始化成功返回 True，失败返回 False

**功能**:
- 初始化音频处理器
- 初始化语音识别器
- 设置回调函数
- 创建输出目录

##### `run_interactive()`
启动交互式语音识别会话。

**功能**:
- 开始录音
- 启动识别循环
- 处理用户输入
- 显示识别结果

##### `list_audio_devices() -> List[Dict]`
获取可用的音频设备列表。

**返回值**:
```python
[
    {
        'index': int,        # 设备索引
        'name': str,         # 设备名称
        'channels': int,     # 通道数
        'sample_rate': int   # 采样率
    }
]
```

##### `cleanup()`
清理资源，停止录音和识别。

### 2. SpeechRecognizer (语音识别器)

**位置**: `src/speech_recognizer.py`

#### 类定义
```python
class SpeechRecognizer:
    def __init__(self, config: ModelConfig)
    def initialize(self) -> bool
    def create_stream(self)
    def process_audio(self, audio_data: np.ndarray)
    def set_result_callback(self, callback)
    def set_partial_result_callback(self, callback)
    def cleanup(self)
```

#### 方法说明

##### `__init__(config: ModelConfig)`
初始化语音识别器。

**参数**:
- `config`: ModelConfig 实例，包含模型配置信息

##### `initialize() -> bool`
初始化识别器和模型。

**返回值**:
- `bool`: 初始化成功返回 True，失败返回 False

##### `create_stream()`
创建识别流。

**返回值**:
- 识别流对象

##### `process_audio(audio_data: np.ndarray)`
处理音频数据并进行识别。

**参数**:
- `audio_data`: numpy 数组，包含音频样本数据

##### `set_result_callback(callback)`
设置最终识别结果回调函数。

**参数**:
- `callback`: 回调函数，签名为 `callback(result: RecognitionResult)`

##### `set_partial_result_callback(callback)`
设置部分识别结果回调函数。

**参数**:
- `callback`: 回调函数，签名为 `callback(result: str)`

### 3. AudioProcessor (音频处理器)

**位置**: `src/audio_processor.py`

#### 类定义
```python
class AudioProcessor:
    def __init__(self, config: AudioConfig)
    def initialize(self) -> bool
    def start_recording(self, callback)
    def stop_recording()
    def get_device_list(self) -> List[Dict]
    def cleanup()
```

#### 方法说明

##### `__init__(config: AudioConfig)`
初始化音频处理器。

**参数**:
- `config`: AudioConfig 实例，包含音频配置信息

##### `initialize() -> bool`
初始化音频设备。

**返回值**:
- `bool`: 初始化成功返回 True，失败返回 False

##### `start_recording(callback)`
开始录音。

**参数**:
- `callback`: 音频数据回调函数，签名为 `callback(audio_data: np.ndarray)`

##### `stop_recording()`
停止录音。

##### `get_device_list() -> List[Dict]`
获取音频设备列表。

**返回值**:
```python
[
    {
        'index': int,
        'name': str,
        'max_input_channels': int,
        'default_sample_rate': float
    }
]
```

### 4. ModelDownloader (模型下载器)

**位置**: `src/model_downloader.py`

#### 类定义
```python
class ModelDownloader:
    def __init__(self, config: ModelConfig)
    def download_model(self, model_name: str, force: bool = False)
    def list_available_models(self) -> Dict
    def get_model_status(self) -> Dict
    def interactive_download()
```

#### 方法说明

##### `download_model(model_name: str, force: bool = False)`
下载指定模型。

**参数**:
- `model_name`: 模型名称
- `force`: 是否强制重新下载

##### `list_available_models() -> Dict`
获取可用模型列表。

**返回值**:
```python
{
    'model_key': {
        'name': str,
        'description': str,
        'size': str,
        'url': str
    }
}
```

##### `interactive_download()`
交互式模型下载。

## 数据结构

### RecognitionResult

**位置**: `src/speech_recognizer.py`

```python
class RecognitionResult:
    def __init__(self, text: str, timestamp: float, is_final: bool = True)

    # 属性
    text: str           # 识别文本
    timestamp: float    # 时间戳
    is_final: bool      # 是否为最终结果
    confidence: float   # 置信度
```

#### 方法

##### `to_dict() -> Dict`
转换为字典格式。

**返回值**:
```python
{
    'text': str,
    'timestamp': float,
    'is_final': bool,
    'confidence': float
}
```

##### `__str__() -> str`
返回格式化的字符串表示。

### RecognitionSession

**位置**: `src/speech_recognizer.py`

```python
class RecognitionSession:
    def __init__()

    # 属性
    results: List[RecognitionResult]  # 识别结果列表
    start_time: float                 # 会话开始时间
    is_active: bool                   # 会话是否活跃
```

#### 方法

##### `add_result(result: RecognitionResult)`
添加识别结果。

##### `get_duration() -> float`
获取会话持续时间。

##### `to_dict() -> Dict`
转换为字典格式。

## 配置类

### ModelConfig

**位置**: `src/config.py`

```python
class ModelConfig:
    # 模型文件路径
    model_dir: Path
    tokens: str
    encoder: str
    decoder: str
    joiner: str

    # 语音识别参数
    sample_rate: int = 16000
    feature_dim: int = 80
    num_threads: int = 1

    # 端点检测参数
    enable_endpoint: bool = True
    enable_endpoint_detection: bool = True
    rule1_min_trailing_silence: float = 2.4
    rule2_min_trailing_silence: float = 1.2
    rule3_min_utterance_length: int = 300

    # 解码方法
    decoding_method: str = "greedy_search"
    max_active_paths: int = 4
    provider: str = "cpu"
```

#### 方法

##### `validate_model_files() -> List[str]`
验证模型文件是否存在。

**返回值**:
- `List[str]`: 缺失的文件路径列表

### AudioConfig

**位置**: `src/config.py`

```python
class AudioConfig:
    sample_rate: int = 16000      # 采样率
    chunk_size: int = 1024        # 音频块大小
    channels: int = 1             # 声道数
    format: Any = None            # 音频格式
    samples_per_read: int         # 每次读取样本数
```

### AppConfig

**位置**: `src/config.py`

```python
class AppConfig:
    show_partial_results: bool = True     # 显示部分结果
    show_timestamps: bool = True          # 显示时间戳
    log_level: str = "INFO"              # 日志级别
    log_file: Path                       # 日志文件路径
    output_file: Path                    # 输出文件路径
    save_to_file: bool = True            # 保存到文件
```

## 回调函数接口

### 音频数据回调

```python
def audio_callback(audio_data: np.ndarray) -> None:
    """
    音频数据回调函数

    参数:
        audio_data: 音频数据，numpy数组，形状为 (samples,)
    """
    pass
```

### 识别结果回调

```python
def result_callback(result: RecognitionResult) -> None:
    """
    最终识别结果回调函数

    参数:
        result: 识别结果对象
    """
    pass
```

### 部分识别结果回调

```python
def partial_result_callback(text: str) -> None:
    """
    部分识别结果回调函数

    参数:
        text: 部分识别文本
    """
    pass
```

## 使用示例

### 基本使用

```python
from src import RealTimeVTT

# 创建应用实例
app = RealTimeVTT()

# 初始化
if app.initialize():
    # 运行交互式识别
    app.run_interactive()
else:
    print("初始化失败")

# 清理资源
app.cleanup()
```

### 自定义回调

```python
from src import SpeechRecognizer, ModelConfig, RecognitionResult

def my_result_callback(result: RecognitionResult):
    print(f"识别结果: {result.text}")
    print(f"时间戳: {result.timestamp}")
    print(f"置信度: {result.confidence}")

def my_partial_callback(text: str):
    print(f"部分结果: {text}")

# 创建识别器
config = ModelConfig()
recognizer = SpeechRecognizer(config)

# 设置回调
recognizer.set_result_callback(my_result_callback)
recognizer.set_partial_result_callback(my_partial_callback)

# 初始化并使用
if recognizer.initialize():
    # 处理音频数据
    # recognizer.process_audio(audio_data)
    pass
```

### 音频设备管理

```python
from src import AudioProcessor, AudioConfig

# 创建音频处理器
config = AudioConfig()
processor = AudioProcessor(config)

# 初始化
if processor.initialize():
    # 获取设备列表
    devices = processor.get_device_list()
    for device in devices:
        print(f"设备 {device['index']}: {device['name']}")

    # 开始录音
    def audio_callback(data):
        print(f"接收到音频数据: {len(data)} 样本")

    processor.start_recording(audio_callback)

    # 停止录音
    processor.stop_recording()

    # 清理
    processor.cleanup()
```

## 错误处理

### 异常类型

系统可能抛出以下异常：

- `FileNotFoundError`: 模型文件不存在
- `RuntimeError`: 音频设备初始化失败
- `ValueError`: 配置参数错误
- `ImportError`: 依赖库缺失

### 错误处理示例

```python
try:
    app = RealTimeVTT()
    if not app.initialize():
        raise RuntimeError("应用初始化失败")
    app.run_interactive()
except FileNotFoundError as e:
    print(f"文件不存在: {e}")
except RuntimeError as e:
    print(f"运行时错误: {e}")
except KeyboardInterrupt:
    print("用户中断")
finally:
    app.cleanup()
```

## 性能考虑

### 内存使用
- 模型加载约占用 200-500MB 内存
- 音频缓冲区约占用 10-50MB 内存
- 建议系统内存不少于 4GB

### CPU使用
- 识别过程主要使用 CPU
- 建议使用多核 CPU
- 可通过 `num_threads` 参数调整线程数

### 延迟优化
- 调整 `chunk_size` 参数可影响延迟
- 较小的 `chunk_size` 延迟更低但CPU占用更高
- 建议值：1024-4096

## 扩展开发

### 添加新的识别模型

1. 在 `ModelDownloader.MODELS` 中添加模型信息
2. 更新模型文件映射
3. 测试模型兼容性

### 添加新的音频格式支持

1. 修改 `AudioConfig` 类
2. 更新 `AudioProcessor` 的初始化逻辑
3. 添加格式转换代码

### 添加新的输出格式

1. 创建新的输出处理类
2. 在 `RealTimeVTT` 中集成
3. 添加相应的配置选项

---

本API文档涵盖了系统的主要接口和使用方法。如需更详细的信息，请参考源代码注释。