Skip to content

Conversation

@chandra-siri
Copy link
Collaborator

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: l Pull request size is large. api: storage Issues related to the googleapis/python-storage API. labels Dec 27, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @chandra-siri, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a collection of new Python scripts and a pytest benchmark designed to thoroughly evaluate and measure the performance of asynchronous operations with Google Cloud Storage. The primary focus is on benchmarking parallel uploads and downloads, leveraging Python's asyncio for concurrent I/O and multiprocessing for CPU-bound parallelism, thereby providing comprehensive tools to assess throughput and latency across various scenarios.

Highlights

  • Asynchronous Download Benchmarks: Introduced async_tasks_downloade_mp.py, async_tasks_downloader.py, and parallel_downloader.py to measure the performance of parallel asynchronous downloads from Google Cloud Storage, utilizing both asyncio and multiprocessing.
  • Asynchronous Upload Benchmarks: Added parallel_uploader.py to benchmark parallel asynchronous uploads of data to Google Cloud Storage.
  • Pytest Read Benchmark: Integrated test_reads.py as a formal pytest-benchmark test to assess the throughput of single-object asynchronous downloads with configurable chunk sizes.
  • Utility Scripts: Included random_number.py for generating secure 64-bit random integers and test_asyncio_gc.py for exploring asyncio task garbage collection behavior.
  • move_blob Test: A new script move_blob_test.py was added to demonstrate and test the move_blob functionality for Google Cloud Storage objects.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Dec 27, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a collection of benchmarking and testing scripts. My review focuses on improving code quality, fixing bugs, and increasing efficiency. Key findings include a critical bug in a time calculation, several instances of inefficient client creation within loops, and various opportunities for code cleanup such as removing unused imports and commented-out code. I've also suggested improvements for better script structure and more informative benchmark reporting.

I am having trouble creating individual review comments. Click here to see my feedback.

parallel_downloader.py (60)

critical

There is a critical bug in the time calculation due to operator precedence. start_time / 10**9 is calculated before the subtraction. You need to wrap end_time - start_time in parentheses to get the correct duration.

        f"\nFinished all download attempts for {num_objects} objects: took - {(end_time - start_time) / 10**9}s"

parallel_downloader.py (16)

high

Creating a new AsyncGrpcClient for each object download is inefficient as it involves setting up a new connection each time. The client should be created once per worker process and reused for all downloads within that process. Consider creating the client in download_worker and passing it to this function.

threaded_downloader.py (14)

high

Creating a new AsyncGrpcClient for each download is inefficient. Since each download_worker runs in its own thread and creates its own event loop, the client should be created once per thread in download_worker and passed to download_object_async.

tests/perf/microbenchmarks/test_reads.py (66)

high

The output_buffer is initialized here and then immediately re-initialized on line 69. This first initialization is redundant and can be removed.

parallel_uploader.py (17)

high

Creating a new AsyncGrpcClient for each object upload is inefficient. This creates a new connection for every task. The client should be instantiated once per worker process in upload_worker and passed to this function to be reused.

parallel_downloader.py (23)

medium

The object size is hardcoded here. It would be better to define it as a constant at the top of the file (e.g., OBJECT_SIZE = 100 * 1024 * 1024) for better readability and maintainability.

parallel_uploader.py (31)

medium

The bucket name is hardcoded. It's better to define this as a constant at the top of the file for easier configuration.

async_tasks_downloader.py (65)

medium

There is a typo in FINSHED. It should be FINISHED.

        f"FINISHED: total bytes downloaded - {num_objects*OBJECT_SIZE} in time {(end_time - start_time) / (10**9)}s"

move_blob_test.py (3-9)

medium

The script's logic is executed at the module level. It's a best practice to encapsulate the main logic within a main() function and call it under an if __name__ == "__main__": guard. This prevents the code from running when the module is imported elsewhere and improves reusability and testability.

def main():
    gcs = storage.Client()
    bucket = gcs.bucket("chandrasiri-us-west1-hns-soft-del")
    # print(bucket.name)
    blob = bucket.blob("test/blob.csv")
    blob.upload_from_string("")
    print("Uploaded blob:", blob.name)
    bucket.move_blob(blob, new_name="test/blob2.csv")

if __name__ == "__main__":
    main()

move_blob_test.py (4)

medium

The bucket name is hardcoded. For better portability and reusability, consider defining it as a constant at the top of the file or passing it as a command-line argument.

parallel_downloader.py (3)

medium

The os module is imported but never used. It should be removed.

async_tasks_downloader.py (4)

medium

The ThreadPoolExecutor is imported but not used in this file. It should be removed to keep the code clean.

async_tasks_downloade_mp.py (4)

medium

The ThreadPoolExecutor is imported but never used in this file. It's good practice to remove unused imports to keep the code clean.

parallel_downloader.py (34)

medium

The bucket name is hardcoded. Consider defining this as a constant at the top of the file to make it easier to change.

async_tasks_downloade_mp.py (83-84)

medium

There's a typo in 'Throuput'. It should be 'Throughput'.

Additionally, the throughput calculation uses 10**-6 to convert from bytes to megabytes, which assumes 1 MB = 1,000,000 bytes. However, OBJECT_SIZE is defined using binary prefixes (1024*1024). For consistency, you should use (1024*1024) for the conversion to MiB/s.

        f"Throughput: {num_object*OBJECT_SIZE /((end_time_proc - start_time_proc) / (10**9))/(1024*1024)} MiB/s"

async_tasks_downloade_mp.py (76)

medium

The results variable is assigned but its value is never used. It should be removed to avoid confusion.

        pool.starmap(async_runner, args)

async_tasks_downloader.py (36)

medium

This docstring is empty. Please add a description of what the function does.

    """Creates and runs asyncio tasks to download a range of objects."""

random_number.py (12-20)

medium

The script's logic is executed at the module level. It's a best practice to encapsulate this logic within a main() function and call it under an if __name__ == "__main__": guard. This prevents the code from running when the module is imported elsewhere.

def main():
    # Generate 1000 unique IDs
    # A set is the easiest way to guarantee uniqueness in the batch.
    request_ids = set()
    while len(request_ids) < 1000:
        request_ids.add(generate_random_64bit_int())

    # You can convert it to a list if needed
    id_list = list(request_ids)

    print(f"Generated {len(id_list)} unique 64-bit IDs.")
    print("First 5 IDs:", id_list[:5])

if __name__ == "__main__":
    main()

tests/perf/microbenchmarks/test_reads.py (16-21)

medium

These modules (math, google.api_core.exceptions, google.cloud.storage.blob.Blob) are imported but not used in the file. They should be removed to keep the code clean.

tests/perf/microbenchmarks/test_reads.py (42)

medium

The comment indicates the object size is 1 GiB, but 100 * (1024 ** 2) is 100 MiB. The comment should be corrected to avoid confusion.

OBJECT_SIZE = 100 * (1024 ** 2)  # 100 MiB

async_tasks_downloade_mp.py (71-73)

medium

These lines appear to be leftover debugging code. They should be removed before merging to keep the codebase clean.

tests/perf/microbenchmarks/test_reads.py (102-109)

medium

This docstring appears to be a list of implementation notes rather than a description of the function's purpose. It should be updated to be a proper docstring that explains what my_setup does, its parameters, and what it returns.

tests/perf/microbenchmarks/test_reads.py (210)

medium

The benchmark summary prints 'N/A' for standard deviation. pytest-benchmark calculates this value and it's available in benchmark.stats['stddev']. It would be more informative to include this statistic in the report. Note that the other metrics are in MiB/s, so you may need to convert the time-based standard deviation or adjust the table header.

tests/perf/microbenchmarks/test_reads.py (219-234)

medium

This large block of commented-out code should be removed to improve readability and reduce clutter.

async_tasks_downloade_mp.py (38)

medium

This docstring is empty. Please add a meaningful description of what the download_objects_pool function does, its parameters, and what it returns, or remove the docstring if it's not needed.

    """Downloads a pool of objects asynchronously within a single process."""

threaded_downloader.py (32)

medium

The bucket name is hardcoded. It's better to define this as a constant at the top of the file for easier configuration and maintenance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: storage Issues related to the googleapis/python-storage API. size: xl Pull request size is extra large.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant