Skip to content

Conversation

@eugenevinitsky
Copy link

@eugenevinitsky eugenevinitsky commented Jan 3, 2026

Summary

  • Adds map_sources config option for training with maps from multiple directories
  • Format: "path1:weight1,path2:weight2,..."
  • Creates temp directory with symlinks to maps sampled according to weights

Example Usage

puffer train puffer_drive --env.map-sources "resources/drive/binaries/training:0.75,resources/drive/binaries/carla:0.25"

Output:

================================================================================
MIXED MAP SOURCES
  resources/drive/binaries/training: 75 maps (75.0%)
  resources/drive/binaries/carla: 25 maps (25.0%)
Total: 100 maps in /tmp/pufferdrive_maps_xxx
================================================================================

Test plan

  • Tested with 75% waymo + 25% carla maps
  • Training runs successfully with mixed sources
  • Training runs normally when we don't use mixed sources

🤖 Generated together with Claude Code

Allows training with maps from multiple sources with configurable weights.
Format: "path1:weight1,path2:weight2,..."

Example:
  --env.map-sources "resources/drive/binaries/training:0.75,resources/drive/binaries/carla:0.25"

This creates a temp directory with symlinks to maps from each source,
sampled according to the specified weights.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@greptile-apps
Copy link

greptile-apps bot commented Jan 3, 2026

Greptile Summary

This PR adds a map_sources configuration option that enables training with maps from multiple directories using weighted sampling. The implementation creates a temporary directory with symlinks to maps sampled according to specified weights (e.g., "path1:0.75,path2:0.25").

Key Changes:

  • New create_mixed_map_directory() function handles parsing, weight normalization, sampling (with or without replacement), and symlink creation
  • Updated Drive.__init__() to support the new map_sources parameter that overrides map_dir
  • Added comprehensive test suite covering various scenarios including edge cases
  • Updated configuration and documentation

Issues Identified:

  • Previous review comments about temp directory cleanup, self.map_dir reassignment bug during resampling, and global random state pollution remain valid concerns
  • The create_mixed_map_directory() function lacks a seed parameter, which could make map selection non-deterministic across runs despite the Drive class accepting a seed parameter

Confidence Score: 3/5

  • This PR has known issues from previous review that need resolution before merging
  • The feature implementation is solid with good test coverage, but three logic issues identified in previous reviews remain unaddressed: temp directory cleanup, map_dir reassignment bug during resampling, and global random state pollution. Additionally, the lack of seed parameter in map selection could affect reproducibility
  • Pay close attention to pufferlib/ocean/drive/drive.py - particularly the temp directory lifecycle, self.map_dir usage during resampling (lines 324-326), and random state management

Important Files Changed

Filename Overview
pufferlib/ocean/drive/drive.py Added map_sources feature to mix maps from multiple directories with weighted sampling; includes temp directory creation with symlinks
tests/test_map_sources.py Comprehensive test suite for map_sources feature covering symlink creation, sampling, and error cases
docs/src/data.md Updated error message in documentation from specific path to generic .bin not found message
pufferlib/config/ocean/drive.ini Added map_sources config option with documentation and example usage

Sequence Diagram

sequenceDiagram
    participant User
    participant Drive.__init__
    participant create_mixed_map_directory
    participant Filesystem
    participant binding.shared
    
    User->>Drive.__init__: Initialize with map_sources param
    alt map_sources provided
        Drive.__init__->>create_mixed_map_directory: Parse map_sources_str, num_maps
        create_mixed_map_directory->>create_mixed_map_directory: Parse and normalize weights
        create_mixed_map_directory->>create_mixed_map_directory: Calculate map counts per source
        create_mixed_map_directory->>Filesystem: Create temp directory
        loop For each source
            create_mixed_map_directory->>Filesystem: List .bin files
            create_mixed_map_directory->>create_mixed_map_directory: Sample maps (with/without replacement)
        end
        create_mixed_map_directory->>create_mixed_map_directory: Shuffle selected maps
        loop For each selected map
            create_mixed_map_directory->>Filesystem: Create symlink in temp dir
        end
        create_mixed_map_directory->>Filesystem: Register atexit cleanup handler
        create_mixed_map_directory-->>Drive.__init__: Return temp_dir path
        Drive.__init__->>Drive.__init__: Update self.map_dir to temp_dir
    else map_dir provided
        Drive.__init__->>Filesystem: Check map_dir exists
        Drive.__init__->>Filesystem: Count available .bin files
        Drive.__init__->>Drive.__init__: Validate num_maps vs available
    end
    Drive.__init__->>binding.shared: Initialize with map_dir, num_maps
    binding.shared-->>Drive.__init__: Return agent_offsets, map_ids, num_envs
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

map_counts.append(count)

# Create temp directory
temp_dir = tempfile.mkdtemp(prefix="pufferdrive_maps_")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Temp directory created but never cleaned up - will accumulate in /tmp over multiple runs. Consider using tempfile.TemporaryDirectory() context manager or add cleanup in the close() method

Prompt To Fix With AI
This is a comment left during a code review.
Path: pufferlib/ocean/drive/drive.py
Line: 54:54

Comment:
**logic:** Temp directory created but never cleaned up - will accumulate in `/tmp` over multiple runs. Consider using `tempfile.TemporaryDirectory()` context manager or add cleanup in the `close()` method

How can I resolve this? If you propose a fix, please make it concise.

Path to temp directory containing symlinked maps
"""
if seed is not None:
random.seed(seed)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Using module-level random.seed() affects global random state - may cause unexpected behavior if other code uses random module. Use random.Random(seed) instance instead

rng = random.Random(seed) if seed is not None else random.Random()
# Then use rng.choices() and rng.shuffle() instead
Prompt To Fix With AI
This is a comment left during a code review.
Path: pufferlib/ocean/drive/drive.py
Line: 27:27

Comment:
**logic:** Using module-level `random.seed()` affects global random state - may cause unexpected behavior if other code uses `random` module. Use `random.Random(seed)` instance instead

```python
rng = random.Random(seed) if seed is not None else random.Random()
# Then use rng.choices() and rng.shuffle() instead
```

How can I resolve this? If you propose a fix, please make it concise.

@eugenevinitsky eugenevinitsky changed the title Add map_sources option for mixing maps from multiple directories [WIP] Add map_sources option for mixing maps from multiple directories Jan 3, 2026
@eugenevinitsky
Copy link
Author

@greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. tests/test_map_sources.py, line 94-113 (link)

    style: Test is non-deterministic due to lack of seeding in create_mixed_map_directory(). The assertion on line 112-113 could fail if random sampling produces different distributions. Consider either:

    1. Adding seed support to the function and seeding this test
    2. Using a tolerance-based check instead of exact equality

4 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +64 to +71
sampled = random.sample(available, k=count)
else:
sampled = random.choices(available, k=count)
for map_file in sampled:
selected_maps.append(os.path.join(path, map_file))

# Shuffle to mix sources
random.shuffle(selected_maps)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Missing seed parameter for reproducibility - random.sample(), random.choices(), and random.shuffle() use global random state. Different runs with same config will produce different map selections.

Consider adding a seed parameter to create_mixed_map_directory() and using random.Random(seed) instance:

def create_mixed_map_directory(map_sources_str, num_maps, seed=None):
    rng = random.Random(seed) if seed is not None else random.Random()
    # Then use rng.sample(), rng.choices(), rng.shuffle()

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: pufferlib/ocean/drive/drive.py
Line: 64:71

Comment:
**logic:** Missing seed parameter for reproducibility - `random.sample()`, `random.choices()`, and `random.shuffle()` use global random state. Different runs with same config will produce different map selections.

Consider adding a `seed` parameter to `create_mixed_map_directory()` and using `random.Random(seed)` instance:

```python
def create_mixed_map_directory(map_sources_str, num_maps, seed=None):
    rng = random.Random(seed) if seed is not None else random.Random()
    # Then use rng.sample(), rng.choices(), rng.shuffle()
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +212 to +213
map_dir = create_mixed_map_directory(map_sources, num_maps)
self.map_dir = map_dir # Update instance variable to point to temp directory
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Need to pass seed to create_mixed_map_directory() for reproducible map selection:

Suggested change
map_dir = create_mixed_map_directory(map_sources, num_maps)
self.map_dir = map_dir # Update instance variable to point to temp directory
map_dir = create_mixed_map_directory(map_sources, num_maps, seed)
self.map_dir = map_dir # Update instance variable to point to temp directory
Prompt To Fix With AI
This is a comment left during a code review.
Path: pufferlib/ocean/drive/drive.py
Line: 212:213

Comment:
**logic:** Need to pass seed to `create_mixed_map_directory()` for reproducible map selection:

```suggestion
            map_dir = create_mixed_map_directory(map_sources, num_maps, seed)
            self.map_dir = map_dir  # Update instance variable to point to temp directory
```

How can I resolve this? If you propose a fix, please make it concise.

@eugenevinitsky eugenevinitsky changed the title [WIP] Add map_sources option for mixing maps from multiple directories Add map_sources option for mixing maps from multiple directories Jan 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants