-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
Problem Description
When mapping ID variables (like household_id, spm_unit_id) between non-hierarchical entities using calculate(..., map_to=...), PolicyEngine Core inappropriately averages these values, producing nonsensical fractional IDs.
Minimal Reproducible Example
from policyengine_us import Microsimulation
import pandas as pd
sim = Microsimulation(dataset='hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5')
# Map household_id to tax_unit level
household_ids_tax_unit = sim.calculate('household_id', map_to='tax_unit')
# Check for fractional values (IDs should always be integers)
fractional_ids = household_ids_tax_unit.values[household_ids_tax_unit.values % 1 != 0]
print(f"Found {len(fractional_ids)} fractional household IDs")
print(f"Examples: {fractional_ids[:5]}")
# Output: [218. 153.5 153.5 172.5 172.5]Root Cause
In policyengine_core/simulations/simulation.py, the map_result method handles mapping between non-hierarchical group entities (e.g., household → tax_unit) by:
- First mapping from source to person using
how="mean"(averaging) - Then mapping from person to target using
how="sum"(summing)
This is mathematically inappropriate for ID fields, which are categorical identifiers, not numeric quantities that should be averaged or summed.
Impact
- Produces invalid ID values that break referential integrity
- Can cause silent bugs in downstream analysis
- Affects any code that relies on ID mapping between non-hierarchical entities
Proposed Solutions
- Short-term: Add a warning when mapping variables with "_id" suffix between non-hierarchical entities
- Medium-term: Add a variable attribute to mark categorical/ID variables that should not be aggregated mathematically
- Long-term: Implement proper ID mapping logic that preserves the most common ID or uses a different strategy appropriate for categorical data
Affected Variables
Testing shows at least these ID variables produce fractional values when mapped to tax_unit:
household_idspm_unit_id
Environment
- policyengine-core version: 3.20.0
- policyengine-us version: 1.399.1
- Python version: 3.13
Metadata
Metadata
Assignees
Labels
No labels