-
Notifications
You must be signed in to change notification settings - Fork 10
Add stacked dataset builder and P(county|CD) distributions #457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Core components: - sparse_matrix_builder.py: Database-driven approach for building calibration matrices - calibration_utils.py: Shared utilities (cache clearing, constraints, geo helpers) - matrix_tracer.py: Debugging utility for tracing through sparse matrices - create_stratified_cps.py: Create stratified sample preserving high-income households - test_sparse_matrix_builder.py: 6 verification tests for matrix correctness Data pipeline changes: - Add GEO_STACKING env var to cps.py and puf.py for geo-stacking data generation - Add GEO_STACKING_MODE env var to extended_cps.py - Add CPS_2024_Full, PUF_2023, ExtendedCPS_2023 classes - Add policy_data.db download to prerequisites - Add 'make data-geo' target for geo-stacking data pipeline CI/CD: - Add geo-stacking dataset build step to workflow - Add sparse matrix builder test step after geo data generation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Move sparse matrix tests to tests/test_local_area_calibration/ - Split large test file into focused modules (column indexing, same-state, cross-state, geo masking) - Fix small_enhanced_cps.py enum encoding (decode_to_str before astype) - Fix create_stratified_cps.py to use local storage instead of HuggingFace - Remove CPS_2024_Full to keep PR minimal - Revert ExtendedCPS_2024 to use CPS_2024 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…tionality - Rename GEO_STACKING to LOCAL_AREA_CALIBRATION in cps.py, puf.py, extended_cps.py - Rename data-geo to data-local-area in Makefile and workflow - Add create_target_groups function to calibration_utils.py - Enhance MatrixTracer with get_group_rows method and variable_desc in row catalog - Add TARGET GROUPS section to print_matrix_structure output - Add local_area_calibration_setup.ipynb documentation notebook 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…t builder - Add make_county_cd_distributions.py to compute P(county|CD) from Census block data - Add county_cd_distributions.csv with distributions for all 436 CDs - Add county_assignment.py module for assigning counties to households - Add stacked_dataset_builder.py for creating CD-stacked H5 datasets - Add tests for county assignment functionality - Update calibration_utils.py with state/CD mapping utilities 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Closes #458 |
- New GitHub Actions workflow (local_area_publish.yaml) that: - Triggers on local_area_calibration/ changes, repository_dispatch, or manual - Downloads calibration inputs from HF calibration/ folder - Builds 51 state + 436 district H5 files with checkpointing - Uploads to GCP and HF states/ and districts/ subdirectories - New publish_local_area.py script with: - Per-state and per-district checkpointing for spot instance resilience - Immediate upload after each file is built - Support for --states-only, --districts-only, --skip-download flags - Added upload_local_area_file() to data_upload.py for subdirectory uploads - Added download_calibration_inputs() to huggingface.py - Added publish-local-area Makefile target 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- download_private_prerequisites.py: Download from calibration/policy_data.db - calibration_utils.py: Look for db in storage/calibration/ - conftest.py: Update test fixture path - huggingface.py: Fix download_calibration_inputs to return correct paths 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Create a minimal 50-household H5 fixture with known values for stable testing of the stacked dataset builder without relying on sampled stratified CPS data. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Cast np.arange output to int32 to match column dtype. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
c9c6fd8 to
18d635a
Compare
…ication - Add spm_unit_tenure_type mapping from SPM_TENMORTSTATUS in add_spm_variables - Fix create_stratified_cps.py to use source sim's input_variables instead of empty sim - Fix stacked_dataset_builder.py to use base_sim's input_variables instead of sparse_sim The input_variables fix ensures variables like spm_unit_tenure_type are preserved when creating stratified/stacked datasets, since input_variables is only populated from variables that have actual data in the loaded dataset. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…lder - Add spm-calculator integration for SPM threshold calculation - Replace random placeholder geoadj with real values from Census ACS rent data - Add load_cd_geoadj_values() to compute geoadj from median 2BR rents - Add calculate_spm_thresholds_for_cd() to calculate SPM thresholds per CD - Add CD rent data CSV and fetch script (requires CENSUS_API_KEY) - Update .gitignore to track rent CSV 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add upload_local_area_batch_to_hf() to batch multiple files per commit - Add skip_hf parameter to upload_local_area_file() for GCP-only uploads - Modify publish_local_area.py to batch HF uploads (10 files per commit) - Fix at-large district geoadj lookup (XX01 -> XX00 mapping for AK, DE, etc.) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
… to gitignore Pseudo-inputs are variables with adds/subtracts that aggregate formula-based components. Saving their stale pre-computed values corrupts calculations when the dataset is reloaded. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…e_type - Accept main's SPM threshold calculation using calculate_spm_thresholds_with_geoadj() - Preserve branch's spm_unit_tenure_type variable for local area calibration - Refactor calibration_utils.py to import TENURE_CODE_MAP from utils/spm.py - Remove duplicate SPM_TENURE_CODE_TO_CALC definition 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Update notebook to use correct db path: storage/calibration/policy_data.db - Add download as dependency of data target in Makefile 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Use os.path.dirname(__file__) instead of relative path so tests work regardless of working directory. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add uv.lock file with all pinned dependencies - Update all workflows to use `uv sync --dev` instead of pip install - Add lock freshness check to PR workflow - Narrow Python version to >=3.12 (required by microimpute) This prevents stale cached packages on the self-hosted runner from causing test failures (e.g., missing spm_unit_tenure_type variable). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
uv sync creates a virtual environment, but commands were running with system Python which still had stale cached packages. All make/python/pytest commands now use `uv run` to execute within the virtual environment where the locked dependencies are installed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Hi @MaxGhenis , I finally got all green here after adding a uv.lock file. I was having a lot of trouble with the runner caching old versions of -us. This was meant to be a small PR, but it has ballooned. If you were to look at one module, it would be the .h5 creation program, If you want an idea of the functionality added in this module specifically, a demonstration been added to the last few cells of the Jupyter notebook in this PR. |
Summary
stacked_dataset_builder.pyfor creating CD-stacked H5 datasets from calibrated weightscounty_assignment.pymodule for assigning counties to households based on congressional districtuv.lockto pin dependency versions and prevent CI failures from stale cached packagesKey Features
CI/CD Improvements
The self-hosted runner was caching old versions of
policyengine-us, causing tests to fail withVariable spm_unit_tenure_type does not existerrors. This PR adds:uv.lock: Pins all dependency versions (similar to policyengine-us)uv sync --dev: Installs exact locked versions into a virtual environmentuv run: Executes all commands within the virtual environmentuv.lockis up-to-dateThis ensures reproducible builds regardless of runner cache state.
Test Plan
test_county_assignment.py)🤖 Generated with Claude Code