Skip to content

Conversation

@sjmonson
Copy link
Collaborator

Summary

Sets continuous_usage_stats to get token usage on incomplete requests. If usage is still unavailable fall back to iteration count.

Details

In v0.3.0 and earlier the number of iterations was used as proxy for output token count in incomplete requests that did not return usage metrics. In v0.4.0 this behavior was removed which lead to large discrepancies in output token count based on the percentage of the benchmark consisting of incomplete requests.

This PR restore the original behavior of falling back to number of iterations. Additionally it sets the continuous_usage_stats flag to enable usage metrics on every iteration, when available.

Test Plan

  • Run a long-generation, high concurrency benchmark using a max-seconds constraint. For incomplete requests check that output_tokens is greater than 0 for some requests.

Related Issues


  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Up to 9% decrease in output tokens per second between v0.3 and v0.4

2 participants