Mitigate deadlock on DLL unload #16416

felixweilbach · 2025-12-30T09:53:56Z

Summary:
ThreadPool gets stored in a static variable here extension/threadpool/threadpool.cpp:146

This means the destructor of ThreadPool will be run when the process exits or a DLL containing this code unloads.

While working with ExecuTorch I experienced a deadlock during unloading our DLL (which contained ExecuTorch) at runtime. This was caused by the pthreadpool_destroy function pthreadpool/src/windows.c:366 waiting forever on the worker threads.

Why this is happening exactly is unclear to me. It is likely a race condition inside Windows Parallel Loader (https://blogs.blackberry.com/en/2017/10/windows-10-parallel-loading-breakdown) as I could see its functions in the stack trace of the stuck worker threads after they returned from their main function.

The issue was mitigated on my side by calling executorch::extension::threadpool::get_threadpool()->_unsafe_reset_threadpool(0); before unloading the DLL.

This is just a workaround. I think a proper fix would be to rework the ThreadPool singleton and allow for explicit termination of it.

Differential Revision: D89889628

pytorch-bot · 2025-12-30T09:54:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16416

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

B200 runners are down due to network issues

❌ 2 New Failures, 1 Unrelated Failure

As of commit 2d43467 with merge base daf93a1 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for extension/threadpool/threadpool.h:
pull / unittest-editable / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 2

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / android / run-emulator (gh) (#16137)
Timeout waiting for emulator to boot.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-12-30T09:54:10Z

@felixweilbach has exported this pull request. If you are a Meta employee, you can view the originating Diff in D89889628.

github-actions · 2025-12-30T09:54:44Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Summary: ThreadPool gets stored in a static variable here extension/threadpool/threadpool.cpp:146 This means the destructor of ThreadPool will be run when the process exits or a DLL containing this code unloads. While working with ExecuTorch I experienced a deadlock during unloading our DLL (which contained ExecuTorch) at runtime. This was caused by the pthreadpool_destroy function pthreadpool/src/windows.c:366 waiting forever on the worker threads. Why this is happening exactly is unclear to me. It is likely a race condition inside Windows Parallel Loader (https://blogs.blackberry.com/en/2017/10/windows-10-parallel-loading-breakdown) as I could see its functions in the stack trace of the stuck worker threads after they returned from their main function. The issue was mitigated on my side by calling `executorch::extension::threadpool::get_threadpool()->_unsafe_reset_threadpool(0);` before unloading the DLL. This is just a workaround. I think a proper fix would be to rework the ThreadPool singleton and allow for explicit termination of it. Differential Revision: D89889628

kimishpatel

Review automatically exported from Phabricator review in Meta.

felixweilbach requested a review from kimishpatel as a code owner December 30, 2025 09:53

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 30, 2025

meta-codesync bot added fb-exported meta-exported labels Dec 30, 2025

felixweilbach force-pushed the export-D89889628 branch from 29e330c to 2d43467 Compare December 30, 2025 15:44

kimishpatel approved these changes Dec 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mitigate deadlock on DLL unload #16416

Mitigate deadlock on DLL unload #16416

felixweilbach commented Dec 30, 2025

Uh oh!

pytorch-bot bot commented Dec 30, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Dec 30, 2025

Uh oh!

github-actions bot commented Dec 30, 2025

Uh oh!

kimishpatel left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mitigate deadlock on DLL unload #16416

Are you sure you want to change the base?

Mitigate deadlock on DLL unload #16416

Conversation

felixweilbach commented Dec 30, 2025

Uh oh!

pytorch-bot bot commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16416

❗ 1 Active SEVs

❌ 2 New Failures, 1 Unrelated Failure

Uh oh!

meta-codesync bot commented Dec 30, 2025

Uh oh!

github-actions bot commented Dec 30, 2025

This PR needs a release notes: label

Uh oh!

kimishpatel left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Dec 30, 2025 •

edited

Loading

This PR needs a `release notes:` label