-
Notifications
You must be signed in to change notification settings - Fork 0
Add batch QA runner and caching support #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: demo-qa
Are you sure you want to change the base?
Add batch QA runner and caching support #70
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| passed = answer.strip() == case.expected.strip() | ||
| detail = None if passed else f"expected={case.expected!r}, got={answer!r}" | ||
| return ExpectedCheck(mode="exact", expected=case.expected, passed=passed, detail=detail) | ||
| if case.expected_regex is not None: | ||
| pattern = re.compile(case.expected_regex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Batch cases crash on malformed expected fields
_match_expected unconditionally calls string methods/regex compilation on the user-supplied expectations. If a JSONL case provides a non-string expected (or expected_contains) or an invalid expected_regex, these lines raise (e.g., AttributeError or re.error), aborting the entire batch run instead of returning a mismatch/error result that respects --fail-on/--max-fails. The batch CLI therefore cannot tolerate a single malformed case entry.
Useful? React with 👍 / 👎.
|
@codex, please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
examples/demo_qa/cli.py
Outdated
| case_p = sub.add_subparsers(dest="case_command") | ||
| case_root = sub.add_parser("case", help="Single-case utilities") | ||
| case_sub = case_root.add_subparsers(dest="case_command", required=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLI parser crashes building subcommands
Any invocation of the demo CLI fails before reaching the handlers because sub is an argparse _SubParsersAction (from parser.add_subparsers), which does not provide add_subparsers; calling it raises AttributeError during parser construction. As a result python -m examples.demo_qa.cli ... aborts immediately instead of running chat/batch/case commands. This line should be removed or the nested subparsers should be added to the case parser instead.
Useful? React with 👍 / 👎.
|
@codex, please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| def _match_expected(case: Case, answer: str | None) -> ExpectedCheck | None: | ||
| if not case.has_asserts: | ||
| return None | ||
| expected_value = case.expected or case.expected_regex or case.expected_contains or "" | ||
| if answer is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plan-only runs fail all cases with expectations
In _match_expected a missing answer is treated as a failed assertion (ExpectedCheck(... passed=False, detail="no answer")). When --plan-only is used, run_question never populates artifacts.answer, so any case with expected/expected_* is marked mismatch and counted as a failure even though the user explicitly skipped fetch/synthesis. This makes plan-only mode unusable on asserted datasets (the batch exit code will be non-zero even when nothing actually ran). The expectation check should be skipped or downgraded when plan_only is set.
Useful? React with 👍 / 👎.
examples/demo_qa/batch.py
Outdated
| baseline_filter_path = args.only_failed_from | ||
| if args.only_failed and not baseline_filter_path: | ||
| latest = _load_latest_run(artifacts_dir) | ||
| if latest: | ||
| baseline_filter_path = latest / "results.jsonl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest-run pointer ignores custom --out results location
--only-failed derives its baseline from runs/latest.txt by appending results.jsonl, but latest.txt always points to the artifacts run folder (written at the end of each batch) and does not take a custom --out into account. If a prior run wrote results elsewhere via --out, a subsequent --only-failed run will look for <run_folder>/results.jsonl, hit load_results, and exit with an error even though the baseline exists at the user-provided path. The latest pointer should reference the actual results path or --only-failed should honor --out when locating the baseline.
Useful? React with 👍 / 👎.
|
@codex, please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
examples/demo_qa/batch.py
Outdated
| if fail_on == "error": | ||
| bad = {"error"} | ||
| elif fail_on == "mismatch": | ||
| bad = {"mismatch"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--fail-on mismatch treats errors as success
is_failure resets the failure set to only {"mismatch"} when fail_on is "mismatch", dropping "error" and "failed". With that flag, a batch run where cases hit provider/LLM errors will report exit_code = 0 and a clean summary even though nothing executed successfully, which defeats CI/automation that expects errors to fail the run. The mismatch mode should still consider errors failing instead of silently treating them as success.
Useful? React with 👍 / 👎.
|
@codex, please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
examples/demo_qa/batch.py
Outdated
| def is_failure(status: str, fail_on: str, require_assert: bool) -> bool: | ||
| bad = {"error", "failed", "mismatch"} | ||
| unchecked = {"unchecked", "plan_only"} | ||
| if require_assert: | ||
| bad |= unchecked | ||
| if fail_on == "error": | ||
| bad = {"error"} | ||
| elif fail_on == "mismatch": | ||
| bad = {"error", "failed", "mismatch"} | ||
| elif fail_on == "unchecked": | ||
| bad |= unchecked | ||
| elif fail_on == "bad": | ||
| bad = {"error", "failed", "mismatch"} | ||
| if require_assert: | ||
| bad |= unchecked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honor --require-assert when fail_on=mismatch/error
is_failure adds unchecked/plan_only statuses when require_assert is set, but for fail_on == "mismatch" (and "error") it immediately overwrites the set and drops those statuses, so --require-assert is silently ignored unless the user also passes --fail-on bad/unchecked/any. A batch like demo_qa.cli batch --fail-on mismatch --require-assert ... will exit 0 even if every case is unchecked/plan_only, defeating the flag’s intent to fail missing assertions.
Useful? React with 👍 / 👎.
|
@codex, please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not case.has_asserts: | ||
| return None | ||
| expected_value = case.expected or case.expected_regex or case.expected_contains or "" | ||
| if answer is None: | ||
| return ExpectedCheck(mode="none", expected=expected_value, passed=False, detail="no answer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plan-only runs mark asserted cases as mismatches
Plan-only batch/REPL runs still flow through _match_expected, which returns a failed ExpectedCheck whenever answer is None (lines 185‑189). When run_one(..., plan_only=True) executes cases with expected/expected_regex/expected_contains, those cases are recorded as mismatch rather than plan_only, so is_failure(..., fail_on=bad) triggers a non-zero exit even though fetch/synthesis was intentionally skipped. A plan-only dry run should stay plan_only/unchecked or bypass expectation checks to avoid spurious failures when using --plan-only.
Useful? React with 👍 / 👎.
|
@codex, please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
examples/demo_qa/batch.py
Outdated
| bad_statuses = {"mismatch", "failed", "error"} | ||
| if require_assert or fail_on in {"unchecked", "any"}: | ||
| bad_statuses |= {"unchecked", "plan_only"} | ||
| target_ids = {case_id for case_id, res in baseline_for_filter.items() if res.status in bad_statuses} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honor fail_on when filtering --only-failed cases
The rerun filter in _select_cases_for_rerun hardcodes bad_statuses to mismatch/failed/error and only extends it for unchecked/plan_only, so it ignores other --fail-on modes such as skipped. With demo_qa batch --only-failed --fail-on skipped, is_failure will treat skipped cases as failures, but they are never selected for rerun here, meaning skipped cases cannot be retried or surfaced even though the caller requested them to fail the build.
Useful? React with 👍 / 👎.
|
@codex, please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
examples/demo_qa/batch.py
Outdated
| bad_count = counts.get("mismatch", 0) + counts.get("failed", 0) + counts.get("error", 0) | ||
| unchecked = counts.get("unchecked", 0) | ||
| plan_only = counts.get("plan_only", 0) | ||
| if args.require_assert or args.fail_on in {"unchecked", "any"}: | ||
| bad_count += unchecked + plan_only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include skipped failures in BAD summary when requested
When running demo_qa batch --fail-on skipped, is_failure treats skipped cases as failures and the command exits non‑zero, but the BAD count printed in the summary is computed only from mismatches/errors (lines 317‑321) and never adds skipped cases. That leads to summaries like BAD: 0 even though the exit code indicates failure due solely to skipped cases, which is misleading for users trying to triage runs. Add skipped statuses to the BAD tally when fail_on includes skipped so the summary aligns with the exit criteria.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
examples/demo_qa/settings.py
Outdated
| from pydantic import BaseModel, Field, ValidationError, field_validator, model_validator | ||
| from pydantic_settings.sources.providers.toml import TomlConfigSettingsSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Demo extra pins incompatible pydantic-settings
The new import path for TomlConfigSettingsSource (pydantic_settings.sources.providers.toml) requires the newer pydantic-settings releases, but the demo/dev extras in pyproject.toml still declare pydantic-settings>=2.2. Installing the demo as recommended (pip install -e .[demo]) will therefore pull 2.2.x, and the CLI will now crash at module import with ModuleNotFoundError: No module named 'pydantic_settings.sources.providers' before it can show the friendly dependency message. Please raise the extra’s minimum version to the level implied by this import (or revert to the old import path) so a default install can run the new batch commands.
Useful? React with 👍 / 👎.
|
@codex, please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| from pydantic import BaseModel, Field, ValidationError, field_validator, model_validator | ||
| from pydantic_settings.sources.providers.toml import TomlConfigSettingsSource | ||
|
|
||
| try: | ||
| from pydantic_settings import BaseSettings, SettingsConfigDict, TomlConfigSettingsSource | ||
| from pydantic_settings import BaseSettings, SettingsConfigDict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix TomlConfigSettingsSource import path
The settings module now imports TomlConfigSettingsSource from pydantic_settings.sources.providers.toml, but pydantic-settings 2.x exposes this class from pydantic_settings/pydantic_settings.sources and does not ship a sources.providers package. The module raises ModuleNotFoundError as soon as it is imported (before the guarded try/except), so any CLI command or test that loads examples.demo_qa.settings fails even when the declared dependency (pydantic-settings>=2.12) is installed. Please restore the import to the public location so configuration loading works.
Useful? React with 👍 / 👎.
|
@codex, please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| base_by_id = {res.id: res for res in base_results} | ||
| new_by_id = {res.id: res for res in new_results} | ||
| all_ids = sorted(new_by_id.keys()) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diff ignores baseline-only cases
The comparison logic only iterates over new_results IDs (all_ids = sorted(new_by_id.keys())), so any case present in the baseline but missing from the new results is silently dropped from new_fail/still_fail and from the JUnit tests count. When a batch run skips or truncates cases (e.g., a crash or running a filtered subset), the compare report will show no regression for those missing cases and the test count will shrink instead of flagging them, masking real failures. Consider diffing over the union of baseline and new IDs so omitted cases are reported explicitly.
Useful? React with 👍 / 👎.
…logic-in-runner.py Handle missing cases when comparing demo QA runs
…lexanderOnischenko/fetchgraph into codex/add-batch-command-to-demo_qa-cli
|
@codex, please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| error_artifacts = artifacts or RunArtifacts(run_id=run_id, run_dir=runs_root, question=line) | ||
| error_artifacts.error = error_artifacts.error or str(exc) | ||
| last_artifacts = error_artifacts | ||
| _save_artifacts(error_artifacts) | ||
| save_artifacts(error_artifacts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Save REPL error artifacts to unique run folders
In the REPL fallback path, if the agent run raises before returning a RunResult, the code builds the placeholder RunArtifacts with run_dir=runs_root (the shared parent directory) and then writes error artifacts there. This mixes error files into the root of the runs tree and can overwrite or obscure artifacts from other runs, and the /run command will no longer point to a per-run folder. The error path should use the same per-run directory layout as the success path so failures are isolated and do not clobber existing artifacts.
Useful? React with 👍 / 👎.
…py-into-runs-and-commands - move run layout, scope, effective, coverage, and case-history helpers into a new examples/demo_qa/runs package - extract history and report handlers into examples/demo_qa/commands and simplify batch.py orchestration and CLI imports - move run result writing into a neutral runs/io.py, relocate effective diff reader, and streamline report helper dependencies - carry effective fail policy into tag reports, clean helper imports, and avoid private reason helpers
…-resilient-to-interruptions Handle partial batch runs and missed case selection
Add run tagging
Summary
Testing
python -m examples.demo_qa.cli batch --help(fails in this environment becausepydantic-settingsis not installed)Codex Task