-
Notifications
You must be signed in to change notification settings - Fork 5
Investigate/token-doubling-issue #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This fixes the issue where token counts appeared to nearly double at each call in the REPL display. The root cause was that the Gemini API returns cumulative token counts for multi-turn conversations, but the display was showing these cumulative values as if they were per-request metrics. Changes: - Modified SessionTokens struct to track previous totals for each component (prompt, response, cached, thoughts, tool_use) - Updated RecordMetrics to calculate per-request deltas by subtracting previous totals from current API response values - Each TokenMetrics now represents only the cost of that single request (input + output + cache), not cumulative session totals - Updated tests to expect correct per-request delta calculations Now when making multiple requests in a session: - First request shows accurate token usage - Second request shows ONLY the tokens consumed by that request - Session totals accumulate correctly from per-request deltas - No more doubling or misleading token metrics
Modified the event display system to use the per-request token metrics that have already been calculated with deltas, instead of creating new metrics from raw API values. Changes: - Added GetLastMetric() method to SessionTokens to retrieve the most recently recorded metric with correct per-request delta calculations - Updated PrintEventEnhanced() in event.go to use GetLastMetric() instead of reconstructing metrics from UsageMetadata This ensures that when displaying token metrics in the spinner: - Metrics show only the tokens consumed by the current request - Cached tokens are properly accounted for - No misleading cumulative values are displayed - Each event shows accurate per-request token consumption
- Add TestFormatSessionSummary_WithThinkingTokens to verify thinking tokens in session summaries - Add TestFormatGlobalSummary to test global summary formatting - All 11 tracking tests passing - Verifies thinking tokens are properly formatted and displayed in all metrics contexts
Display improvements: - Inline session summary now shows 'Session: 28K actual | 26K cached (92%) | 2K response' instead of raw totals, making cache efficiency immediately visible Session summary now displays: - 💰 Cost Metrics: actual vs cached tokens with cost savings estimate - 🔧 Token Breakdown: detailed component breakdown (input, output, thinking, tool use) - 📈 Session Efficiency: cache hit rate %, avg tokens/request, session duration Key metrics that matter: - 'Actual Tokens' = new prompt + response (what you pay for) - 'Cached Tokens' = tokens reused from prompt cache (90% cost savings) - 'Cache Hit Rate %' = percentage of processed tokens saved via caching - 'Cost Savings' = rough estimate of token savings from cache reuse This helps users understand cache efficiency and actual API costs at a glance.
…ementation - METRICS_DISPLAY_GUIDE.md: User-friendly reference for understanding token metrics - Explains actual vs cached tokens - Cache hit rate interpretation - Cost savings calculations - Optimization tips - Updated metrics formatter (RenderTokenMetrics): - Shows 'Session: 28K actual | 26K cached (92%) | 2K response' format - Calculates cache efficiency percentage - Uses compact notation for readability - New formatCompactNumber() helper for K/M abbreviations - Updated session summary (FormatSessionSummary): - Three-section layout: Cost Metrics, Token Breakdown, Session Efficiency - Shows actual tokens (prompt + response) separately - Displays cache hit rate percentage - Shows estimated cost savings from caching - Better labels and structure for understanding
…indicators Inline Session Summary: - Changed 'actual' → 'cost' (clearer what you're paying for) - Changed 'response' → 'out' (shorter, clear output indicator) - Added cache efficiency indicators (�� excellent, ✅ good,⚠️ modest, ❌ minimal) - Format: 'Session: cost:21K | cached:20K (48% good) | out:442' Session Summary Details: - 'Actual Tokens' → 'New Tokens' (explicit about new vs cached) - 'Cached Tokens' → 'Cache Reuse' (describes what cached tokens represent) - 'Saved Cost' → 'Cost Savings' (more action-oriented) - 'Total Proc' → 'API Billing' (clarifies billing metric) Benefits: - Users immediately see if cache is working (via emoji indicator) - Terminology clearly describes cost vs reuse vs billing - Visual feedback on cache effectiveness: 80%+=excellent, 50%+=good, 20%+=modest, <20%=minimal - Easier to scan and understand at a glance - No manual percentage interpretation needed
Changed 'out:X' → 'response:X' in inline session summary. Why this improves clarity: - 'out' was ambiguous and unclear - 'response' explicitly means AI output tokens - Matches terminology in session summary breakdown - Clear unit: tokens (always abbreviated as numbers) - Consistent with 'cost', 'cached', 'response' terminology Example: Before: Session: cost:21K | cached:20K (48% good) | out:186 After: Session: cost:21K | cached:20K (48% good) | response:186 All values are in TOKENS (abbreviated as K for thousands)
Changed 'response:X' → 'token:X' in inline session summary. Why 'token' is better: - Explicit unit: token (not ambiguous like 'response') - All three metrics now clearly use same unit: cost, cached, token - No confusion with bytes/octets - Simple and direct - Perfect match for API concept Display now: Session: cost:21K | cached:20K (48%⚠️ modest) | token:512 All values clearly in TOKENS: - cost:X = new tokens you paid for - cached:X = tokens reused from cache - token:X = output tokens generated - K = thousands (21K = 21,000 tokens)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.