Skip to content

Conversation

@rlplays
Copy link

@rlplays rlplays commented Oct 18, 2025

To use this: In your config.ini, set backend=Multithreading instead of Multiprocessing under [vec] to take advantage of it. For envs that are CPU deep (i.e. envs that do a lot of computation per c_step) but not GPU wide (i.e. not too many params), this offers a nice speed up anywhere from 1.1x to 3x as fast.

For envs that have a very few CPU cycles as part of c_step, this may not provide a good speedup and in fact may be slower (see below for results). You can also limit the max number of threads say max_num_threads=4 under [vec] if your env is very short and still take advantage of native multi-threading.

Performance Comparison: Multithreading vs. Multiprocessing

Environment SPS BEFORE SPS AFTER (MT only) AFTER (TODO torch pin) Notes
rlplays 44K 112K - (my pixel platformer env)
go 520K 690K -
pacman 800K 890K -
drone_swarm 1.1M 940K - (*)
enduro 720K 520K - (*)
terraform 370K 380K - (**)

Notes:

  • (*) For some environments, Multiprocessing or even Serial processing is better than spreading across too many threads
  • (**) GPU bound environment - most time spent in copying/learning operations

(*) For some envs, it's better to do Multiprocessing or even Serial with a worker just going through the envs one by one rather than shard it across too many threads. I verified this matches the perf when limiting max_num_threads to be a small number (or even 0 to force Serial mode) rather than spread to all cores.

(**) GPU bound. Most time spent copying/learning.

…ng a new backend Multithreading.

In your `config.ini`, set `backend=Multithreading` instead of `Multiprocessing` under `[vec]` to take advantage of it.
For envs that are CPU deep (i.e. envs that do a lot of computation per `c_step`) but not GPU wide (i.e. not too many params),
this offers a nice speed up anywhere from 1.1x to 3x as fast.

For envs that have a very few CPU cycles as part of `c_step`, this may not provide a good speedup and in fact may be slower
(see below for results). You can also limit the max number of threads say `max_num_threads=4` under `[vec]` if your env is
very short and still take advantage of native multi-threading.

 env        SPS
           BEFORE       AFTER (MT only)    AFTER (torch pin)
rlplays      44K         112K
(^ my env)
go          520K         690K
pacman      800K         890K
drone_swarm 1.1M         940K (*)
enduro      720K         520K (*)
terraform   370K         380K (**)

(*) For some envs, it's better to do Multiprocessing or even Serial with a worker just going through the envs one by one rather than shard it across too many envs.
I verified this match the perf when limiting num_threads to be a small number (or just try plain Serial) rather than spread to all cores.

(**) GPU bound. Most time spent copying/learning.


// TODO(perumaal): Should this be multi-thread aware as well? (see vec_step below).
// Main issue is that srand is not thread-safe. But do we care?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like we should have a puffer_rnd or something like that so it's TLS aware. But envs may be complex that use rnd a lot. I am not sure if it matters though?

@rlplays rlplays changed the title This introduces native multi-threading in a single Python process using a new backend Multithreading. Native multi-threading backend Oct 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant