Add batch QA runner and caching support #70

AlexanderOnischenko · 2025-12-20T11:56:45Z

Summary

refactor the demo QA REPL to use a shared agent runner with persisted artifacts
add a batch CLI command that executes JSONL cases, writes per-case outputs, summarizes results, and controls failure handling
remove the experimental local LLM cache hooks since caching is handled by the provider

Testing

⚠️ python -m examples.demo_qa.cli batch --help (fails in this environment because pydantic-settings is not installed)

Codex Task

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-20T12:01:01Z

examples/demo_qa/runner.py

+        passed = answer.strip() == case.expected.strip()
+        detail = None if passed else f"expected={case.expected!r}, got={answer!r}"
+        return ExpectedCheck(mode="exact", expected=case.expected, passed=passed, detail=detail)
+    if case.expected_regex is not None:
+        pattern = re.compile(case.expected_regex)


Batch cases crash on malformed expected fields

_match_expected unconditionally calls string methods/regex compilation on the user-supplied expectations. If a JSONL case provides a non-string expected (or expected_contains) or an invalid expected_regex, these lines raise (e.g., AttributeError or re.error), aborting the entire batch run instead of returning a mismatch/error result that respects --fail-on/--max-fails. The batch CLI therefore cannot tolerate a single malformed case entry.

Useful? React with 👍 / 👎.

AlexanderOnischenko · 2025-12-20T12:56:26Z

@codex, please review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-20T12:58:11Z

examples/demo_qa/cli.py

+    case_p = sub.add_subparsers(dest="case_command")
+    case_root = sub.add_parser("case", help="Single-case utilities")
+    case_sub = case_root.add_subparsers(dest="case_command", required=True)


CLI parser crashes building subcommands

Any invocation of the demo CLI fails before reaching the handlers because sub is an argparse _SubParsersAction (from parser.add_subparsers), which does not provide add_subparsers; calling it raises AttributeError during parser construction. As a result python -m examples.demo_qa.cli ... aborts immediately instead of running chat/batch/case commands. This line should be removed or the nested subparsers should be added to the case parser instead.

Useful? React with 👍 / 👎.

AlexanderOnischenko · 2025-12-20T13:19:49Z

@codex, please review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-20T13:22:39Z

examples/demo_qa/runner.py

+def _match_expected(case: Case, answer: str | None) -> ExpectedCheck | None:
+    if not case.has_asserts:
+        return None
+    expected_value = case.expected or case.expected_regex or case.expected_contains or ""
+    if answer is None:


Plan-only runs fail all cases with expectations

In _match_expected a missing answer is treated as a failed assertion (ExpectedCheck(... passed=False, detail="no answer")). When --plan-only is used, run_question never populates artifacts.answer, so any case with expected/expected_* is marked mismatch and counted as a failure even though the user explicitly skipped fetch/synthesis. This makes plan-only mode unusable on asserted datasets (the batch exit code will be non-zero even when nothing actually ran). The expectation check should be skipped or downgraded when plan_only is set.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-12-20T13:22:39Z

examples/demo_qa/batch.py

+    baseline_filter_path = args.only_failed_from
+    if args.only_failed and not baseline_filter_path:
+        latest = _load_latest_run(artifacts_dir)
+        if latest:
+            baseline_filter_path = latest / "results.jsonl"


Latest-run pointer ignores custom --out results location

--only-failed derives its baseline from runs/latest.txt by appending results.jsonl, but latest.txt always points to the artifacts run folder (written at the end of each batch) and does not take a custom --out into account. If a prior run wrote results elsewhere via --out, a subsequent --only-failed run will look for <run_folder>/results.jsonl, hit load_results, and exit with an error even though the baseline exists at the user-provided path. The latest pointer should reference the actual results path or --only-failed should honor --out when locating the baseline.

Useful? React with 👍 / 👎.

AlexanderOnischenko · 2025-12-20T13:40:50Z

@codex, please review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-20T13:42:44Z

examples/demo_qa/batch.py

+    if fail_on == "error":
+        bad = {"error"}
+    elif fail_on == "mismatch":
+        bad = {"mismatch"}


--fail-on mismatch treats errors as success

is_failure resets the failure set to only {"mismatch"} when fail_on is "mismatch", dropping "error" and "failed". With that flag, a batch run where cases hit provider/LLM errors will report exit_code = 0 and a clean summary even though nothing executed successfully, which defeats CI/automation that expects errors to fail the run. The mismatch mode should still consider errors failing instead of silently treating them as success.

Useful? React with 👍 / 👎.

AlexanderOnischenko · 2025-12-20T13:50:44Z

@codex, please review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-20T13:52:40Z

examples/demo_qa/batch.py

+def is_failure(status: str, fail_on: str, require_assert: bool) -> bool:
+    bad = {"error", "failed", "mismatch"}
+    unchecked = {"unchecked", "plan_only"}
+    if require_assert:
+        bad |= unchecked
+    if fail_on == "error":
+        bad = {"error"}
+    elif fail_on == "mismatch":
+        bad = {"error", "failed", "mismatch"}
+    elif fail_on == "unchecked":
+        bad |= unchecked
+    elif fail_on == "bad":
+        bad = {"error", "failed", "mismatch"}
+        if require_assert:
+            bad |= unchecked


Honor --require-assert when fail_on=mismatch/error

is_failure adds unchecked/plan_only statuses when require_assert is set, but for fail_on == "mismatch" (and "error") it immediately overwrites the set and drops those statuses, so --require-assert is silently ignored unless the user also passes --fail-on bad/unchecked/any. A batch like demo_qa.cli batch --fail-on mismatch --require-assert ... will exit 0 even if every case is unchecked/plan_only, defeating the flag’s intent to fail missing assertions.

Useful? React with 👍 / 👎.

AlexanderOnischenko · 2025-12-20T14:11:10Z

@codex, please review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-20T14:13:31Z

examples/demo_qa/runner.py

+    if not case.has_asserts:
+        return None
+    expected_value = case.expected or case.expected_regex or case.expected_contains or ""
+    if answer is None:
+        return ExpectedCheck(mode="none", expected=expected_value, passed=False, detail="no answer")


Plan-only runs mark asserted cases as mismatches

Plan-only batch/REPL runs still flow through _match_expected, which returns a failed ExpectedCheck whenever answer is None (lines 185‑189). When run_one(..., plan_only=True) executes cases with expected/expected_regex/expected_contains, those cases are recorded as mismatch rather than plan_only, so is_failure(..., fail_on=bad) triggers a non-zero exit even though fetch/synthesis was intentionally skipped. A plan-only dry run should stay plan_only/unchecked or bypass expectation checks to avoid spurious failures when using --plan-only.

Useful? React with 👍 / 👎.

AlexanderOnischenko · 2025-12-20T14:20:41Z

@codex, please review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-20T14:25:15Z

examples/demo_qa/batch.py

+    bad_statuses = {"mismatch", "failed", "error"}
+    if require_assert or fail_on in {"unchecked", "any"}:
+        bad_statuses |= {"unchecked", "plan_only"}
+    target_ids = {case_id for case_id, res in baseline_for_filter.items() if res.status in bad_statuses}


Honor fail_on when filtering --only-failed cases

The rerun filter in _select_cases_for_rerun hardcodes bad_statuses to mismatch/failed/error and only extends it for unchecked/plan_only, so it ignores other --fail-on modes such as skipped. With demo_qa batch --only-failed --fail-on skipped, is_failure will treat skipped cases as failures, but they are never selected for rerun here, meaning skipped cases cannot be retried or surfaced even though the caller requested them to fail the build.

Useful? React with 👍 / 👎.

AlexanderOnischenko · 2025-12-20T14:28:53Z

@codex, please review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-20T14:31:33Z

examples/demo_qa/batch.py

+    bad_count = counts.get("mismatch", 0) + counts.get("failed", 0) + counts.get("error", 0)
+    unchecked = counts.get("unchecked", 0)
+    plan_only = counts.get("plan_only", 0)
+    if args.require_assert or args.fail_on in {"unchecked", "any"}:
+        bad_count += unchecked + plan_only


Include skipped failures in BAD summary when requested

When running demo_qa batch --fail-on skipped, is_failure treats skipped cases as failures and the command exits non‑zero, but the BAD count printed in the summary is computed only from mismatches/errors (lines 317‑321) and never adds skipped cases. That leads to summaries like BAD: 0 even though the exit code indicates failure due solely to skipped cases, which is misleading for users trying to triage runs. Add skipped statuses to the BAD tally when fail_on includes skipped so the summary aligns with the exit criteria.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-21T07:20:14Z

examples/demo_qa/settings.py

 from pydantic import BaseModel, Field, ValidationError, field_validator, model_validator
+from pydantic_settings.sources.providers.toml import TomlConfigSettingsSource


Demo extra pins incompatible pydantic-settings

The new import path for TomlConfigSettingsSource (pydantic_settings.sources.providers.toml) requires the newer pydantic-settings releases, but the demo/dev extras in pyproject.toml still declare pydantic-settings>=2.2. Installing the demo as recommended (pip install -e .[demo]) will therefore pull 2.2.x, and the CLI will now crash at module import with ModuleNotFoundError: No module named 'pydantic_settings.sources.providers' before it can show the friendly dependency message. Please raise the extra’s minimum version to the level implied by this import (or revert to the old import path) so a default install can run the new batch commands.

Useful? React with 👍 / 👎.

AlexanderOnischenko · 2025-12-21T07:27:29Z

@codex, please review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-21T07:31:05Z

examples/demo_qa/settings.py

 from pydantic import BaseModel, Field, ValidationError, field_validator, model_validator
+from pydantic_settings.sources.providers.toml import TomlConfigSettingsSource

 try:
-    from pydantic_settings import BaseSettings, SettingsConfigDict, TomlConfigSettingsSource
+    from pydantic_settings import BaseSettings, SettingsConfigDict


Fix TomlConfigSettingsSource import path

The settings module now imports TomlConfigSettingsSource from pydantic_settings.sources.providers.toml, but pydantic-settings 2.x exposes this class from pydantic_settings/pydantic_settings.sources and does not ship a sources.providers package. The module raises ModuleNotFoundError as soon as it is imported (before the guarded try/except), so any CLI command or test that loads examples.demo_qa.settings fails even when the declared dependency (pydantic-settings>=2.12) is installed. Please restore the import to the public location so configuration loading works.

Useful? React with 👍 / 👎.

AlexanderOnischenko · 2025-12-21T07:41:44Z

@codex, please review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-21T07:44:50Z

examples/demo_qa/runner.py

+    base_by_id = {res.id: res for res in base_results}
+    new_by_id = {res.id: res for res in new_results}
+    all_ids = sorted(new_by_id.keys())
+


Diff ignores baseline-only cases

The comparison logic only iterates over new_results IDs (all_ids = sorted(new_by_id.keys())), so any case present in the baseline but missing from the new results is silently dropped from new_fail/still_fail and from the JUnit tests count. When a batch run skips or truncates cases (e.g., a crash or running a filtered subset), the compare report will show no regression for those missing cases and the test count will shrink instead of flagging them, masking real failures. Consider diffing over the union of baseline and new IDs so omitted cases are reported explicitly.

Useful? React with 👍 / 👎.

…logic-in-runner.py Handle missing cases when comparing demo QA runs

…lexanderOnischenko/fetchgraph into codex/add-batch-command-to-demo_qa-cli

AlexanderOnischenko · 2025-12-21T08:08:45Z

@codex, please review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-21T08:12:58Z

examples/demo_qa/chat_repl.py

+            error_artifacts = artifacts or RunArtifacts(run_id=run_id, run_dir=runs_root, question=line)
            error_artifacts.error = error_artifacts.error or str(exc)
            last_artifacts = error_artifacts
-            _save_artifacts(error_artifacts)
+            save_artifacts(error_artifacts)


Save REPL error artifacts to unique run folders

In the REPL fallback path, if the agent run raises before returning a RunResult, the code builds the placeholder RunArtifacts with run_dir=runs_root (the shared parent directory) and then writes error artifacts there. This mixes error files into the root of the runs tree and can overwrite or obscure artifacts from other runs, and the /run command will no longer point to a per-run folder. The error path should use the same per-run directory layout as the success path so failures are isolated and do not clobber existing artifacts.

Useful? React with 👍 / 👎.

…py-into-runs-and-commands - move run layout, scope, effective, coverage, and case-history helpers into a new examples/demo_qa/runs package - extract history and report handlers into examples/demo_qa/commands and simplify batch.py orchestration and CLI imports - move run result writing into a neutral runs/io.py, relocate effective diff reader, and streamline report helper dependencies - carry effective fail policy into tag reports, clean helper imports, and avoid private reason helpers

…-resilient-to-interruptions Handle partial batch runs and missed case selection

Add run tagging

Delete unused LLM cache module

0f079de

AlexanderOnischenko added the codex label Dec 20, 2025 — with ChatGPT Codex Connector

chatgpt-codex-connector bot reviewed Dec 20, 2025

View reviewed changes

AlexanderOnischenko added 3 commits December 20, 2025 15:47

Fix batch logging path and update CLI defaults

ff05569

Add regression-friendly batch controls and case helpers

80b0239

Refactor batch CLI into module and tighten regression semantics

f326869

chatgpt-codex-connector bot reviewed Dec 20, 2025

View reviewed changes

Fix case command parser construction

84451ea

chatgpt-codex-connector bot reviewed Dec 20, 2025

View reviewed changes

Honor custom result paths when rerunning failed cases

b921779

chatgpt-codex-connector bot reviewed Dec 20, 2025

View reviewed changes

Keep errors failing when using --fail-on mismatch

1adb196

chatgpt-codex-connector bot reviewed Dec 20, 2025

View reviewed changes

Respect --require-assert when fail_on is mismatch/error

9be87f2

chatgpt-codex-connector bot reviewed Dec 20, 2025

View reviewed changes

Keep plan-only runs from tripping expected checks

b96ec14

chatgpt-codex-connector bot reviewed Dec 20, 2025

View reviewed changes

Align rerun filtering with fail_on modes

acf0dc7

chatgpt-codex-connector bot reviewed Dec 20, 2025

View reviewed changes

AlexanderOnischenko added 4 commits December 20, 2025 20:16

Fallback to summary results path for latest reruns

6f18d39

Add run registry metadata and stats command

b204997

Add compare command with markdown and junit outputs

557be56

Add tags to results and summary by tag

db65975

chatgpt-codex-connector bot reviewed Dec 21, 2025

View reviewed changes

обновлены требованиня к pydantic settings из-за toml в demo

66c6dab

chatgpt-codex-connector bot reviewed Dec 21, 2025

View reviewed changes

фикс путей импорта на публичные

34d2c56

chatgpt-codex-connector bot reviewed Dec 21, 2025

View reviewed changes

AlexanderOnischenko and others added 4 commits December 21, 2025 10:54

Handle missing cases in demo QA compare

d9184f1

тестовые данные

d76422a

Merge pull request #78 from AlexanderOnischenko/codex/fix-comparison-…

30e661e

…logic-in-runner.py Handle missing cases when comparing demo QA runs

Merge branch 'codex/add-batch-command-to-demo_qa-cli' of github.com:A…

3882a70

…lexanderOnischenko/fetchgraph into codex/add-batch-command-to-demo_qa-cli

chatgpt-codex-connector bot reviewed Dec 21, 2025

View reviewed changes

AlexanderOnischenko and others added 17 commits December 21, 2025 22:25

исключаем логи из отслеживания в репозитории

34675c1

Handle partial batch runs and missed case selection

9deaed7

Improve only-missed baseline resolution and interrupt handling

9eb7178

Add only-missed from path and baseline metadata

bfff2f6

Add effective tag baselines and baseline metadata

5a5f1ee

Stabilize effective scope and suite planning

8582ae8

Differentiate suite and selected coverage

4279c57

Add case history indexing and tag reports

941a001

Clarify only-missed baseline requirement

7d7bb86

Align report policy and clean helper imports

303a601

Merge pull request #80 from AlexanderOnischenko/codex/make-batch-runs…

a66f139

…-resilient-to-interruptions Handle partial batch runs and missed case selection

Merge pull request #82 from AlexanderOnischenko/add-run-tagging

881d11b

Add run tagging

doc update

cf9d33d

апгрейд версии для правильного мерджа с main

9a4a9f3

Merge branch 'demo-qa' into codex/add-batch-command-to-demo_qa-cli

89a4e98

demo_aq (make) - алиасы для команд

aa62371

		from pydantic import BaseModel, Field, ValidationError, field_validator, model_validator
		from pydantic_settings.sources.providers.toml import TomlConfigSettingsSource

Add batch QA runner and caching support #70

Are you sure you want to change the base?

Add batch QA runner and caching support #70

Uh oh!

Conversation

AlexanderOnischenko commented Dec 20, 2025

Summary

Testing

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

AlexanderOnischenko commented Dec 20, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

AlexanderOnischenko commented Dec 20, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

AlexanderOnischenko commented Dec 20, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

AlexanderOnischenko commented Dec 20, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

AlexanderOnischenko commented Dec 20, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

AlexanderOnischenko commented Dec 20, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

AlexanderOnischenko commented Dec 20, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 20, 2025