The random() function means that local area microsimulations will never match their calibration

core's [random() function](https://github.com/PolicyEngine/policyengine-core/blob/master/policyengine_core/commons/formulas.py) (lines 308-348) is used for 3 variables in policyengine-us and 15 variables in policyengine-uk. Two inputs determine the seed for each person:                                                                                                                                    
                                                                                                                                                                                    
  1. Entity ID (`f"{population.entity.key}_id"`)                                                                                                
  2. Call count: How many times random() has been called in the simulation                                                                                                          
                                                                                                                                                                                    
  The seed formula:                                                                                                                                                                 
  `seed = int(abs(id * 100 + population.simulation.count_random_calls))`                                                                                                              
                                                                                                                                                                                    
  Example:                                                                                                                                                                          
  - Person 5, 1st call to random() → `seed = int(abs(5 * 100 + 1)) = 501 `                                                                                                            
  - Person 5, 2nd call to random() → `seed = int(abs(5 * 100 + 2)) = 502`                                                                                                             
  - Person 7, 1st call to random() → `seed = int(abs(7 * 100 + 3)) = 703` 

For context, the multiplication by 100 [has caused integer overflow problems in the past](https://github.com/PolicyEngine/policyengine-core/issues/363).

In the **local area calibration case**, where donor households must have their state_fips swapped, re-keying with with new person_ ids is unavoidable. Because the random() function is linked to person_id, _the final Microsimulation from local area calibration will never match the matrix times the weights._ 

For instance, here is how snap in policyengine-us relates to the random function:
```
snap
  └── snap_gross_income
        └── snap_unearned_income (uses `adds`)
              └── ssi (SSI benefit amount)
                    └── is_ssi_eligible
                          └── meets_ssi_resource_test
                                └── random()  ← stochastic eligibility
```

What this means is that a household with $2000 in snap, that was assigned a weight of 150 - partially due to this snap value - might end up in the final microsimulation with $1800 in snap, but still a weight of 150. Then when we run `Microsimulation.calculate('snap').sum()`, we don't match the values from X @ w in the calibration. Whether that because of a bug in the construction of the very complex X, or is it because of random snap, is very difficult to tell.

***Recommendation*: Use seeds stored in the microdata like the SNAP take-up seed, and remove random from core.**

SNAP's "take-up seed" works quite differently. (policyengine-us/policyengine_us/variables/gov/usda/snap/snap_take_up_seed.py)

  Here's the SNAP takeup mechanism:

  File 1: snap_take_up_seed.py (lines 1-8)
  class snap_take_up_seed(Variable):
      value_type = float
      entity = SPMUnit
      label = "Randomly assigned seed for SNAP take-up"
      definition_period = YEAR
  No formula

  File 2: takes_up_snap_if_eligible.py (lines 10-13)
  def formula(spm_unit, period, parameters):
      seed = spm_unit("snap_take_up_seed", period)
      takeup_rate = parameters(period).gov.usda.snap.takeup_rate
      return seed < takeup_rate

Here, the snap_take_up_seed is defined in cps.py (line 230) in policyengine-us-data:
```
  data["snap_take_up_seed"] = generator.random(len(data["spm_unit_id"]))
```

This approach work better with local area calibration because that seed becomes linked to the household as a sort of property. We could really define one seed value per person, household, etc. (really every unit) and anything random could depend on it. Reproducibility would also be much simpler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The random() function means that local area microsimulations will never match their calibration #412

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The random() function means that local area microsimulations will never match their calibration #412

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions