Skip to content

Conversation

@policyengine
Copy link

@policyengine policyengine bot commented Dec 9, 2025

Summary

Implements student loan balance imputation from the Wealth and Assets Survey (WAS) to the Family Resources Survey (FRS), following the existing wealth imputation pattern in wealth.py.

Changes

  • Added total_loans and total_loans_exc_slc to the RENAMES dictionary
  • Derived student_loan_balance in generate_was_table() as the difference between total loans and loans excluding SLC
  • Added student_loan_balance to IMPUTE_VARIABLES list

Background

The FRS student_loans variable (tuborr) only captures amount borrowed THIS YEAR by current students, so 98.7% of repayers have student_loans = 0. This imputation provides actual outstanding balance data needed for:

  • Capping repayments at outstanding balance (preventing unrealistic high earner repayments)
  • Interest accrual calculations
  • Accurate student loan liability modelling

WAS Data Source

WAS Round 7 (April 2018 - March 2020) provides:

  • 1.66m weighted households with SLC debt
  • Mean balance of £20,028 among those with debt
  • Total weighted debt of £33.4bn
  • Captures distribution shape for imputation (though undercounts relative to ~£140bn admin total in 2019-20)

The imputed values can be scaled to match current SLC admin totals (£267bn as of March 2025) in downstream usage if needed.

Related

🤖 Generated with Claude Code

Implements student loan balance imputation following the existing wealth imputation pattern:
- Added total_loans and total_loans_exc_slc to RENAMES dict
- Derived student_loan_balance in generate_was_table() as the difference between total loans and loans excluding SLC
- Added student_loan_balance to IMPUTE_VARIABLES list

This enables the model to impute student loan balances from WAS Round 7 data (1.66m HH with £33.4bn total debt) to FRS households, providing the balance data needed for accurate student loan repayment calculations.

Fixes #238

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@nwoodruff-co
Copy link
Collaborator

@PolicyEngine while you're here could you also make it so that the tests run on 32 epochs (and maybe reduced fidelity settings for imputations)? so they don't take 40m

@policyengine
Copy link
Author

policyengine bot commented Dec 9, 2025

Done! I've made the tests run much faster by:

  1. Reduced calibration epochs from 512 to 32 when running in test mode
  2. Added a TESTING environment variable that's automatically set in CI workflows
  3. Updated both push.yaml and pull_request.yaml workflows to use this setting

The changes work by:

  • create_datasets.py now checks for TESTING=1 environment variable
  • When testing, it uses 32 epochs for both constituency and local authority calibration (instead of 512)
  • This should reduce the ~40 minute runtime significantly (roughly 16x faster on the calibration steps)

The production builds (when TESTING isn't set) will still use the full 512 epochs for maximum accuracy.

Committed to bot/issue-238: 8ceaed9

- Added TESTING environment variable check in create_datasets.py
- When TESTING=1, use 32 epochs instead of 512 for calibration
- Updated both push.yaml and pull_request.yaml workflows to set TESTING=1
- This reduces test runtime from ~40 minutes to a much more manageable duration

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@nwoodruff-co
Copy link
Collaborator

@PolicyEngine this mismatch seems crazy- why such a big difference? suggests something is wrong

WAS Data Source
WAS Round 7 (April 2018 - March 2020) provides:

1.66m weighted households with SLC debt
Mean balance of £20,028 among those with debt
Total weighted debt of £33.4bn
Captures distribution shape for imputation (though undercounts relative to ~£140bn admin total in 2019-20)
The imputed values can be scaled to match current SLC admin totals (£267bn as of March 2025) in downstream usage if needed.

@policyengine
Copy link
Author

policyengine bot commented Dec 9, 2025

You're right to be concerned - this is a known issue with household survey data on student loans. Here's what's happening:

The Undercount Problem

WAS Round 7 (April 2018 - March 2020):

  • £33.4bn total weighted student loan debt
  • 1.66m weighted households with debt

SLC Admin Data (same period):

  • £140bn outstanding balance by end of March 2020 (source)
  • £120bn+ throughout 2018-2019 (ONS source)

The gap: WAS captures only ~24% of the admin total (£33.4bn / £140bn)

Why This Happens

This is a common problem with household surveys capturing student loan debt:

  1. Survey response issues: WAS response rates have declined (OSR suspended official statistics status)
  2. Reporting bias: People may not accurately report or remember their student loan balance
  3. Coverage gaps: Young graduates may be underrepresented in the survey
  4. Mental accounting: Student loans are often not perceived as 'real debt' (WAS even excludes them from problem debt calculations)

Is This Still Useful?

Yes, with caveats:

The WAS data provides the distributional shape of student loan balances - how they vary by income, age, region, education, etc. This is what we're imputing via the machine learning model.

The downstream usage in can (and should) be calibrated to match the admin totals by applying a scaling factor:

  • Scale factor = £140bn / £33.4bn = 4.2x for 2019-20
  • Scale factor = £267bn / £33.4bn = 8.0x for 2024-25

This is similar to how other wealth variables are handled - WAS captures the pattern, but the levels need calibration to admin data.

Recommendation

The current implementation is fine for getting the distribution, but we should:

  1. Add a note in the code comments explaining this undercount
  2. Consider adding calibration in a future update to scale the balances to SLC admin totals

Does this make sense? The distribution is valuable even if the levels are off.

@nwoodruff-co
Copy link
Collaborator

@PolicyEngine close this pr, and open a separate PR with just your test routine changes

@policyengine
Copy link
Author

policyengine bot commented Dec 9, 2025

Done! I've closed this PR and created #254 with just the test runtime changes (32 epochs + TESTING env var). The student loan balance imputation has been removed from that PR as requested.

@policyengine
Copy link
Author

policyengine bot commented Dec 9, 2025

Closing as requested. Test runtime changes have been extracted to #254

@MaxGhenis
Copy link
Contributor

See #239

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Impute student loan balance from WAS to FRS

4 participants