Skip to content

Conversation

@baogorek
Copy link
Collaborator

@baogorek baogorek commented Dec 9, 2025

Summary

  • Add stacked_dataset_builder.py for creating CD-stacked H5 datasets from calibrated weights
  • Add population-weighted P(county|CD) distributions computed from Census block data
  • Add county_assignment.py module for assigning counties to households based on congressional district
  • Add script to generate county-CD distributions from 119th Congress BEFs and 2020 Census population
  • Add uv.lock to pin dependency versions and prevent CI failures from stale cached packages

Key Features

  • Stacked Dataset Builder: Creates H5 datasets with households replicated across congressional districts, using calibrated weights
  • County Assignment: Assigns realistic county distributions to households based on their CD using Census block-level population data
  • 436 CDs covered: All 435 voting districts plus DC at-large

CI/CD Improvements

The self-hosted runner was caching old versions of policyengine-us, causing tests to fail with Variable spm_unit_tenure_type does not exist errors. This PR adds:

  • uv.lock: Pins all dependency versions (similar to policyengine-us)
  • uv sync --dev: Installs exact locked versions into a virtual environment
  • uv run: Executes all commands within the virtual environment
  • Lock freshness check: PR workflow verifies uv.lock is up-to-date

This ensures reproducible builds regardless of runner cache state.

Test Plan

  • Unit tests for county assignment pass (test_county_assignment.py)
  • NY-10 distribution verified: 55.6% Kings County, 44.4% New York County (matches Census)
  • Manual testing of stacked dataset generation
  • CI passes with new uv.lock workflow

🤖 Generated with Claude Code

baogorek and others added 8 commits December 5, 2025 11:22
Core components:
- sparse_matrix_builder.py: Database-driven approach for building calibration matrices
- calibration_utils.py: Shared utilities (cache clearing, constraints, geo helpers)
- matrix_tracer.py: Debugging utility for tracing through sparse matrices
- create_stratified_cps.py: Create stratified sample preserving high-income households
- test_sparse_matrix_builder.py: 6 verification tests for matrix correctness

Data pipeline changes:
- Add GEO_STACKING env var to cps.py and puf.py for geo-stacking data generation
- Add GEO_STACKING_MODE env var to extended_cps.py
- Add CPS_2024_Full, PUF_2023, ExtendedCPS_2023 classes
- Add policy_data.db download to prerequisites
- Add 'make data-geo' target for geo-stacking data pipeline

CI/CD:
- Add geo-stacking dataset build step to workflow
- Add sparse matrix builder test step after geo data generation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Move sparse matrix tests to tests/test_local_area_calibration/
- Split large test file into focused modules (column indexing, same-state,
  cross-state, geo masking)
- Fix small_enhanced_cps.py enum encoding (decode_to_str before astype)
- Fix create_stratified_cps.py to use local storage instead of HuggingFace
- Remove CPS_2024_Full to keep PR minimal
- Revert ExtendedCPS_2024 to use CPS_2024

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…tionality

- Rename GEO_STACKING to LOCAL_AREA_CALIBRATION in cps.py, puf.py, extended_cps.py
- Rename data-geo to data-local-area in Makefile and workflow
- Add create_target_groups function to calibration_utils.py
- Enhance MatrixTracer with get_group_rows method and variable_desc in row catalog
- Add TARGET GROUPS section to print_matrix_structure output
- Add local_area_calibration_setup.ipynb documentation notebook

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…t builder

- Add make_county_cd_distributions.py to compute P(county|CD) from Census block data
- Add county_cd_distributions.csv with distributions for all 436 CDs
- Add county_assignment.py module for assigning counties to households
- Add stacked_dataset_builder.py for creating CD-stacked H5 datasets
- Add tests for county assignment functionality
- Update calibration_utils.py with state/CD mapping utilities

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@baogorek
Copy link
Collaborator Author

baogorek commented Dec 9, 2025

Closes #458

@baogorek baogorek requested a review from MaxGhenis December 9, 2025 17:06
baogorek and others added 6 commits December 10, 2025 09:21
- New GitHub Actions workflow (local_area_publish.yaml) that:
  - Triggers on local_area_calibration/ changes, repository_dispatch, or manual
  - Downloads calibration inputs from HF calibration/ folder
  - Builds 51 state + 436 district H5 files with checkpointing
  - Uploads to GCP and HF states/ and districts/ subdirectories

- New publish_local_area.py script with:
  - Per-state and per-district checkpointing for spot instance resilience
  - Immediate upload after each file is built
  - Support for --states-only, --districts-only, --skip-download flags

- Added upload_local_area_file() to data_upload.py for subdirectory uploads
- Added download_calibration_inputs() to huggingface.py
- Added publish-local-area Makefile target

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- download_private_prerequisites.py: Download from calibration/policy_data.db
- calibration_utils.py: Look for db in storage/calibration/
- conftest.py: Update test fixture path
- huggingface.py: Fix download_calibration_inputs to return correct paths

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Create a minimal 50-household H5 fixture with known values for stable testing
of the stacked dataset builder without relying on sampled stratified CPS data.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Cast np.arange output to int32 to match column dtype.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
baogorek and others added 12 commits December 12, 2025 11:32
…ication

- Add spm_unit_tenure_type mapping from SPM_TENMORTSTATUS in add_spm_variables
- Fix create_stratified_cps.py to use source sim's input_variables instead of empty sim
- Fix stacked_dataset_builder.py to use base_sim's input_variables instead of sparse_sim

The input_variables fix ensures variables like spm_unit_tenure_type are preserved
when creating stratified/stacked datasets, since input_variables is only populated
from variables that have actual data in the loaded dataset.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…lder

- Add spm-calculator integration for SPM threshold calculation
- Replace random placeholder geoadj with real values from Census ACS rent data
- Add load_cd_geoadj_values() to compute geoadj from median 2BR rents
- Add calculate_spm_thresholds_for_cd() to calculate SPM thresholds per CD
- Add CD rent data CSV and fetch script (requires CENSUS_API_KEY)
- Update .gitignore to track rent CSV

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add upload_local_area_batch_to_hf() to batch multiple files per commit
- Add skip_hf parameter to upload_local_area_file() for GCP-only uploads
- Modify publish_local_area.py to batch HF uploads (10 files per commit)
- Fix at-large district geoadj lookup (XX01 -> XX00 mapping for AK, DE, etc.)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
… to gitignore

Pseudo-inputs are variables with adds/subtracts that aggregate formula-based
components. Saving their stale pre-computed values corrupts calculations when
the dataset is reloaded.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…e_type

- Accept main's SPM threshold calculation using calculate_spm_thresholds_with_geoadj()
- Preserve branch's spm_unit_tenure_type variable for local area calibration
- Refactor calibration_utils.py to import TENURE_CODE_MAP from utils/spm.py
- Remove duplicate SPM_TENURE_CODE_TO_CALC definition

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Update notebook to use correct db path: storage/calibration/policy_data.db
- Add download as dependency of data target in Makefile

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Use os.path.dirname(__file__) instead of relative path so tests
work regardless of working directory.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add uv.lock file with all pinned dependencies
- Update all workflows to use `uv sync --dev` instead of pip install
- Add lock freshness check to PR workflow
- Narrow Python version to >=3.12 (required by microimpute)

This prevents stale cached packages on the self-hosted runner from
causing test failures (e.g., missing spm_unit_tenure_type variable).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
uv sync creates a virtual environment, but commands were running
with system Python which still had stale cached packages.

All make/python/pytest commands now use `uv run` to execute within
the virtual environment where the locked dependencies are installed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@baogorek
Copy link
Collaborator Author

baogorek commented Dec 29, 2025

Hi @MaxGhenis , I finally got all green here after adding a uv.lock file. I was having a lot of trouble with the runner caching old versions of -us.

This was meant to be a small PR, but it has ballooned. If you were to look at one module, it would be the .h5 creation program, policyengine_us_data/datasets/cps/local_area_calibration/stacked_dataset_builder.py, but I realize that is now pushing 1k lines.

If you want an idea of the functionality added in this module specifically, a demonstration been added to the last few cells of the Jupyter notebook in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants