Skip to content

Conversation

@seyeong-han
Copy link

Summary

This PR adds comprehensive performance metrics tracking to MLX Whisper, enabling users to benchmark models, identify bottlenecks, and compare performance with other implementations.

Motivation

Currently, MLX Whisper provides transcription results but lacks visibility into performance characteristics. Users need to:

  • Benchmark different model sizes for their use case
  • Understand where time is spent (model loading, preprocessing, inference)
  • Compare performance with other Whisper implementations (e.g., whisper.cpp)
  • Calculate Real-Time Factor (RTF) to determine if models meet real-time requirements
  • Track inference throughput for optimization

Changes Made

1. Enhanced DecodingResult dataclass (mlx_whisper/decoding.py)

  • Added num_inference_steps: int = 0 field to track total decoder forward passes
  • This metric is comparable to whisper.cpp's "runs" metric

2. Modified _main_loop() method (mlx_whisper/decoding.py)

  • Added inference step counter that increments on each decoder forward pass
  • Returns step count alongside other decoding results
  • Minimal performance overhead (simple integer increment)

3. Added timing metrics (mlx_whisper/transcribe.py)

  • Model load time: Time to load model from disk/cache
  • Mel spectrogram time: Audio preprocessing duration
  • Inference time: Pure model inference duration (per-segment and total)
  • Total time: End-to-end processing time
  • RTF (Real-Time Factor): Ratio of processing time to audio duration

4. Added comprehensive output display (mlx_whisper/transcribe.py)

When --verbose True is set, displays:

================================================================================
BENCHMARK METRICS
================================================================================
Model load time: 386.68 ms
Mel spectrogram time: 62.27 ms
Inference time: 242.75 ms
Total time: 691.70 ms
Audio duration: 20.03 s
RTF (Real-Time Factor): 0.035

Total output tokens: 75
Total inference steps (decoder forward passes): 77
Average output tokens/sec: 308.95
Average inference steps/sec: 317.19
Number of segments: 1

Per-segment details:
Seg#   Out    Steps   Time(s)    Out/s      Steps/s
--------------------------------------------------------------------------------
0      75     77      0.243      308.95     317.19

NOTE:
- RTF < 1.0 means faster than real-time
- 'Output tokens': Final text tokens (excluding special tokens)
- 'Inference steps': Total decoder forward passes (comparable to whisper.cpp 'runs')
================================================================================

Key Metrics Explained

  • Output Tokens: Final transcription tokens only (excludes special tokens like SOT, language, timestamps)
  • Inference Steps: Total decoder forward passes including all tokens (comparable to whisper.cpp "runs")
  • RTF: Real-Time Factor - values < 1.0 indicate faster-than-real-time processing
  • Tokens/sec: Useful output generation rate
  • Steps/sec: Total inference throughput

Benefits

  1. Performance Benchmarking: Easily compare different model sizes
  2. Bottleneck Identification: See where time is spent in the pipeline
  3. Cross-Implementation Comparison: Compare with whisper.cpp and other implementations
  4. Real-Time Capability Assessment: RTF metric shows if model meets real-time requirements
  5. Optimization Guidance: Identify which components to optimize

Testing

Tested on Apple Silicon (M3 Pro) with all available models:

Model Load (ms) Mel (ms) Inference (ms) Total (ms) RTF Tokens/sec
tiny 446 89 167 702 0.035 420
base.en 387 62 243 692 0.035 309
medium 995 64 1177 2236 0.112 60
large-v3 1420 144 2282 3846 0.192 36

All models achieve real-time performance (RTF < 1.0) on Apple Silicon.

Usage Examples

Basic usage (unchanged):

mlx_whisper audio.wav

With detailed metrics:

mlx_whisper audio.wav --verbose True

Benchmark different models:

for model in tiny base.en medium large-v3; do
    echo "Testing $model..."
    mlx_whisper audio.wav --model mlx-community/whisper-$model-mlx --verbose True
done

Files Changed

  • mlx_whisper/decoding.py - Added inference step tracking
  • mlx_whisper/transcribe.py - Added timing metrics and benchmark output

Performance Impact

  • Minimal overhead: Only adds timestamp recording and integer increments
  • Measured overhead: < 0.1% in testing
  • No impact when verbose=False: Metrics calculation still occurs but display is skipped

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant