Skip to content

Is cached_content_token_count supposed to only count "full" cache hits? #1896

@sgdantas

Description

@sgdantas

I was reading over implicit context caching here https://docs.cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview, and my understanding is that the implicit caching would work with partial hits as long as the prefix is fixed. However, I was only able to retrieve non-zero cached_content_token_count when the requests were exactly the same.

Using a slightly different code from the notebook in the docs gives me (at least looking at the usage metadata) no cache hit at all

def main():
    client = Client(
        vertexai=True,
        project=GCP_PROJECT,
        location="us-central1",
    )
    MODEL_ID = "gemini-2.5-flash"
    NUM_ATTEMPTS = 3  
    texts = [
        "Write a short and engaging blog post based on this image.",
        "Describe this image with three words.",
        "What is this image about?",
    ]
    for i in range(NUM_ATTEMPTS):
        response = client.models.generate_content(
            model=MODEL_ID,
            contents=[
                types.Part.from_uri(
                    file_uri="https://storage.googleapis.com/cloud-samples-data/generative-ai/image/a-man-and-a-dog.png",
                    mime_type="image/png",
                ),
                texts[i],
            ],
        )

        cached_token_count = response.usage_metadata.cached_content_token_count or 0

        print(f"#{i + 1} Attempt")
        print(f"Input tokens: {response.usage_metadata.prompt_token_count}")
        print(f"Cached tokens: {cached_token_count}")
        print(f"Output tokens: {response.usage_metadata.candidates_token_count}")
        print(f"Total tokens: {response.usage_metadata.total_token_count}")
        print()

        if cached_token_count > 0:
            print(response.usage_metadata.cache_tokens_details)

Results in

#1 Attempt
Input tokens: 2334
Cached tokens: 0
Output tokens: 316
Total tokens: 4012

#2 Attempt
Input tokens: 2329
Cached tokens: 0
Output tokens: 6
Total tokens: 3208

#3 Attempt
Input tokens: 2328
Cached tokens: 0
Output tokens: 259
Total tokens: 3527

I was expecting at least the image tokens to be cached.
Are partial token count hits available?
Also, is there a difference for caching system instructions (config parameter) versus contents?
The way we've been working with is having the system instructions as fixed instructions defined as a types.GenerateContentConfig and the variable text as types.Part(text=text)

Metadata

Metadata

Assignees

Labels

priority: p3Desirable enhancement or fix. May not be included in next release.type: questionRequest for information or clarification. Not an issue.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions