Add detailed inference metrics tracking to mlx_whisper #1393

seyeong-han · 2025-11-17T21:36:09Z

Summary

This PR adds comprehensive performance metrics tracking to MLX Whisper, enabling users to benchmark models, identify bottlenecks, and compare performance with other implementations.

Motivation

Currently, MLX Whisper provides transcription results but lacks visibility into performance characteristics. Users need to:

Benchmark different model sizes for their use case
Understand where time is spent (model loading, preprocessing, inference)
Compare performance with other Whisper implementations (e.g., whisper.cpp)
Calculate Real-Time Factor (RTF) to determine if models meet real-time requirements
Track inference throughput for optimization

Changes Made

1. Enhanced `DecodingResult` dataclass (`mlx_whisper/decoding.py`)

Added num_inference_steps: int = 0 field to track total decoder forward passes
This metric is comparable to whisper.cpp's "runs" metric

2. Modified `_main_loop()` method (`mlx_whisper/decoding.py`)

Added inference step counter that increments on each decoder forward pass
Returns step count alongside other decoding results
Minimal performance overhead (simple integer increment)

3. Added timing metrics (`mlx_whisper/transcribe.py`)

Model load time: Time to load model from disk/cache
Mel spectrogram time: Audio preprocessing duration
Inference time: Pure model inference duration (per-segment and total)
Total time: End-to-end processing time
RTF (Real-Time Factor): Ratio of processing time to audio duration

4. Added comprehensive output display (`mlx_whisper/transcribe.py`)

When --verbose True is set, displays:

================================================================================
BENCHMARK METRICS
================================================================================
Model load time: 386.68 ms
Mel spectrogram time: 62.27 ms
Inference time: 242.75 ms
Total time: 691.70 ms
Audio duration: 20.03 s
RTF (Real-Time Factor): 0.035

Total output tokens: 75
Total inference steps (decoder forward passes): 77
Average output tokens/sec: 308.95
Average inference steps/sec: 317.19
Number of segments: 1

Per-segment details:
Seg#   Out    Steps   Time(s)    Out/s      Steps/s
--------------------------------------------------------------------------------
0      75     77      0.243      308.95     317.19

NOTE:
- RTF < 1.0 means faster than real-time
- 'Output tokens': Final text tokens (excluding special tokens)
- 'Inference steps': Total decoder forward passes (comparable to whisper.cpp 'runs')
================================================================================

Key Metrics Explained

Output Tokens: Final transcription tokens only (excludes special tokens like SOT, language, timestamps)
Inference Steps: Total decoder forward passes including all tokens (comparable to whisper.cpp "runs")
RTF: Real-Time Factor - values < 1.0 indicate faster-than-real-time processing
Tokens/sec: Useful output generation rate
Steps/sec: Total inference throughput

Benefits

Performance Benchmarking: Easily compare different model sizes
Bottleneck Identification: See where time is spent in the pipeline
Cross-Implementation Comparison: Compare with whisper.cpp and other implementations
Real-Time Capability Assessment: RTF metric shows if model meets real-time requirements
Optimization Guidance: Identify which components to optimize

Testing

Tested on Apple Silicon (M3 Pro) with all available models:

Model	Load (ms)	Mel (ms)	Inference (ms)	Total (ms)	RTF	Tokens/sec
tiny	446	89	167	702	0.035	420
base.en	387	62	243	692	0.035	309
medium	995	64	1177	2236	0.112	60
large-v3	1420	144	2282	3846	0.192	36

All models achieve real-time performance (RTF < 1.0) on Apple Silicon.

Usage Examples

Basic usage (unchanged):

mlx_whisper audio.wav

With detailed metrics:

mlx_whisper audio.wav --verbose True

Benchmark different models:

for model in tiny base.en medium large-v3; do
    echo "Testing $model..."
    mlx_whisper audio.wav --model mlx-community/whisper-$model-mlx --verbose True
done

Files Changed

mlx_whisper/decoding.py - Added inference step tracking
mlx_whisper/transcribe.py - Added timing metrics and benchmark output

Performance Impact

Minimal overhead: Only adds timestamp recording and integer increments
Measured overhead: < 0.1% in testing
No impact when verbose=False: Metrics calculation still occurs but display is skipped

Add detailed inference metrics tracking

15fb6a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add detailed inference metrics tracking to mlx_whisper #1393

Add detailed inference metrics tracking to mlx_whisper #1393

Uh oh!

seyeong-han commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add detailed inference metrics tracking to mlx_whisper #1393

Are you sure you want to change the base?

Add detailed inference metrics tracking to mlx_whisper #1393

Uh oh!

Conversation

seyeong-han commented Nov 17, 2025

Summary

Motivation

Changes Made

1. Enhanced DecodingResult dataclass (mlx_whisper/decoding.py)

2. Modified _main_loop() method (mlx_whisper/decoding.py)

3. Added timing metrics (mlx_whisper/transcribe.py)

4. Added comprehensive output display (mlx_whisper/transcribe.py)

Key Metrics Explained

Benefits

Testing

Usage Examples

Files Changed

Performance Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Enhanced `DecodingResult` dataclass (`mlx_whisper/decoding.py`)

2. Modified `_main_loop()` method (`mlx_whisper/decoding.py`)

3. Added timing metrics (`mlx_whisper/transcribe.py`)

4. Added comprehensive output display (`mlx_whisper/transcribe.py`)