Add stacked dataset builder and P(county|CD) distributions #457

baogorek · 2025-12-09T17:01:57Z

Summary

Add stacked_dataset_builder.py for creating CD-stacked H5 datasets from calibrated weights
Add population-weighted P(county|CD) distributions computed from Census block data
Add county_assignment.py module for assigning counties to households based on congressional district
Add script to generate county-CD distributions from 119th Congress BEFs and 2020 Census population
Add uv.lock to pin dependency versions and prevent CI failures from stale cached packages

Key Features

Stacked Dataset Builder: Creates H5 datasets with households replicated across congressional districts, using calibrated weights
County Assignment: Assigns realistic county distributions to households based on their CD using Census block-level population data
436 CDs covered: All 435 voting districts plus DC at-large

CI/CD Improvements

The self-hosted runner was caching old versions of policyengine-us, causing tests to fail with Variable spm_unit_tenure_type does not exist errors. This PR adds:

uv.lock: Pins all dependency versions (similar to policyengine-us)
uv sync --dev: Installs exact locked versions into a virtual environment
uv run: Executes all commands within the virtual environment
Lock freshness check: PR workflow verifies uv.lock is up-to-date

This ensures reproducible builds regardless of runner cache state.

Test Plan

Unit tests for county assignment pass (test_county_assignment.py)
NY-10 distribution verified: 55.6% Kings County, 44.4% New York County (matches Census)
Manual testing of stacked dataset generation
CI passes with new uv.lock workflow

🤖 Generated with Claude Code

Core components: - sparse_matrix_builder.py: Database-driven approach for building calibration matrices - calibration_utils.py: Shared utilities (cache clearing, constraints, geo helpers) - matrix_tracer.py: Debugging utility for tracing through sparse matrices - create_stratified_cps.py: Create stratified sample preserving high-income households - test_sparse_matrix_builder.py: 6 verification tests for matrix correctness Data pipeline changes: - Add GEO_STACKING env var to cps.py and puf.py for geo-stacking data generation - Add GEO_STACKING_MODE env var to extended_cps.py - Add CPS_2024_Full, PUF_2023, ExtendedCPS_2023 classes - Add policy_data.db download to prerequisites - Add 'make data-geo' target for geo-stacking data pipeline CI/CD: - Add geo-stacking dataset build step to workflow - Add sparse matrix builder test step after geo data generation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Move sparse matrix tests to tests/test_local_area_calibration/ - Split large test file into focused modules (column indexing, same-state, cross-state, geo masking) - Fix small_enhanced_cps.py enum encoding (decode_to_str before astype) - Fix create_stratified_cps.py to use local storage instead of HuggingFace - Remove CPS_2024_Full to keep PR minimal - Revert ExtendedCPS_2024 to use CPS_2024 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…tionality - Rename GEO_STACKING to LOCAL_AREA_CALIBRATION in cps.py, puf.py, extended_cps.py - Rename data-geo to data-local-area in Makefile and workflow - Add create_target_groups function to calibration_utils.py - Enhance MatrixTracer with get_group_rows method and variable_desc in row catalog - Add TARGET GROUPS section to print_matrix_structure output - Add local_area_calibration_setup.ipynb documentation notebook 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…t builder - Add make_county_cd_distributions.py to compute P(county|CD) from Census block data - Add county_cd_distributions.csv with distributions for all 436 CDs - Add county_assignment.py module for assigning counties to households - Add stacked_dataset_builder.py for creating CD-stacked H5 datasets - Add tests for county assignment functionality - Update calibration_utils.py with state/CD mapping utilities 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

baogorek · 2025-12-09T17:05:53Z

Closes #458

- New GitHub Actions workflow (local_area_publish.yaml) that: - Triggers on local_area_calibration/ changes, repository_dispatch, or manual - Downloads calibration inputs from HF calibration/ folder - Builds 51 state + 436 district H5 files with checkpointing - Uploads to GCP and HF states/ and districts/ subdirectories - New publish_local_area.py script with: - Per-state and per-district checkpointing for spot instance resilience - Immediate upload after each file is built - Support for --states-only, --districts-only, --skip-download flags - Added upload_local_area_file() to data_upload.py for subdirectory uploads - Added download_calibration_inputs() to huggingface.py - Added publish-local-area Makefile target 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- download_private_prerequisites.py: Download from calibration/policy_data.db - calibration_utils.py: Look for db in storage/calibration/ - conftest.py: Update test fixture path - huggingface.py: Fix download_calibration_inputs to return correct paths 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Create a minimal 50-household H5 fixture with known values for stable testing of the stacked dataset builder without relying on sampled stratified CPS data. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Cast np.arange output to int32 to match column dtype. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…ication - Add spm_unit_tenure_type mapping from SPM_TENMORTSTATUS in add_spm_variables - Fix create_stratified_cps.py to use source sim's input_variables instead of empty sim - Fix stacked_dataset_builder.py to use base_sim's input_variables instead of sparse_sim The input_variables fix ensures variables like spm_unit_tenure_type are preserved when creating stratified/stacked datasets, since input_variables is only populated from variables that have actual data in the loaded dataset. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…lder - Add spm-calculator integration for SPM threshold calculation - Replace random placeholder geoadj with real values from Census ACS rent data - Add load_cd_geoadj_values() to compute geoadj from median 2BR rents - Add calculate_spm_thresholds_for_cd() to calculate SPM thresholds per CD - Add CD rent data CSV and fetch script (requires CENSUS_API_KEY) - Update .gitignore to track rent CSV 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add upload_local_area_batch_to_hf() to batch multiple files per commit - Add skip_hf parameter to upload_local_area_file() for GCP-only uploads - Modify publish_local_area.py to batch HF uploads (10 files per commit) - Fix at-large district geoadj lookup (XX01 -> XX00 mapping for AK, DE, etc.) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

… to gitignore Pseudo-inputs are variables with adds/subtracts that aggregate formula-based components. Saving their stale pre-computed values corrupts calculations when the dataset is reloaded. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…e_type - Accept main's SPM threshold calculation using calculate_spm_thresholds_with_geoadj() - Preserve branch's spm_unit_tenure_type variable for local area calibration - Refactor calibration_utils.py to import TENURE_CODE_MAP from utils/spm.py - Remove duplicate SPM_TENURE_CODE_TO_CALC definition 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Update notebook to use correct db path: storage/calibration/policy_data.db - Add download as dependency of data target in Makefile 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Use os.path.dirname(__file__) instead of relative path so tests work regardless of working directory. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add uv.lock file with all pinned dependencies - Update all workflows to use `uv sync --dev` instead of pip install - Add lock freshness check to PR workflow - Narrow Python version to >=3.12 (required by microimpute) This prevents stale cached packages on the self-hosted runner from causing test failures (e.g., missing spm_unit_tenure_type variable). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

uv sync creates a virtual environment, but commands were running with system Python which still had stale cached packages. All make/python/pytest commands now use `uv run` to execute within the virtual environment where the locked dependencies are installed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

baogorek · 2025-12-29T19:16:10Z

Hi @MaxGhenis , I finally got all green here after adding a uv.lock file. I was having a lot of trouble with the runner caching old versions of -us.

This was meant to be a small PR, but it has ballooned. If you were to look at one module, it would be the .h5 creation program, policyengine_us_data/datasets/cps/local_area_calibration/stacked_dataset_builder.py, but I realize that is now pushing 1k lines.

If you want an idea of the functionality added in this module specifically, a demonstration been added to the last few cells of the Jupyter notebook in this PR.

baogorek and others added 8 commits December 5, 2025 11:22

Add changelog entry and format code

0400066

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Clear notebook outputs for Myst compatibility

95f21ca

Pin mystmd>=1.7.0 to fix notebook rendering in docs

07852a1

Merge origin/main into district-h5

df7a1f1

baogorek requested a review from MaxGhenis December 9, 2025 17:06

baogorek and others added 6 commits December 10, 2025 09:21

documentation updates

5d6913a

Fix dtype warning in stacked_dataset_builder person ID assignment

5b7fbb8

Cast np.arange output to int32 to match column dtype. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Format test files with black

18d635a

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

baogorek force-pushed the district-h5 branch from c9c6fd8 to 18d635a Compare December 10, 2025 17:51

baogorek and others added 12 commits December 12, 2025 11:32

NYC workflow

af4904e

Add spm-calculator as a dependency

911ad10

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Fix test fixture path to use absolute path

ac69e4f

Use os.path.dirname(__file__) instead of relative path so tests work regardless of working directory. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Trigger CI after runner restart

0fd2c12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add stacked dataset builder and P(county|CD) distributions #457

Add stacked dataset builder and P(county|CD) distributions #457

baogorek commented Dec 9, 2025 •

edited

Loading

Uh oh!

baogorek commented Dec 9, 2025

Uh oh!

baogorek commented Dec 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add stacked dataset builder and P(county|CD) distributions #457

Are you sure you want to change the base?

Add stacked dataset builder and P(county|CD) distributions #457

Conversation

baogorek commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

CI/CD Improvements

Test Plan

Uh oh!

baogorek commented Dec 9, 2025

Uh oh!

baogorek commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

baogorek commented Dec 9, 2025 •

edited

Loading

baogorek commented Dec 29, 2025 •

edited

Loading