Skip to content

Conversation

@AlexanderOnischenko
Copy link
Owner

Summary

  • refactor the demo QA REPL to use a shared agent runner with persisted artifacts
  • add a batch CLI command that executes JSONL cases, writes per-case outputs, summarizes results, and controls failure handling
  • remove the experimental local LLM cache hooks since caching is handled by the provider

Testing

  • ⚠️ python -m examples.demo_qa.cli batch --help (fails in this environment because pydantic-settings is not installed)

Codex Task

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +168 to +172
passed = answer.strip() == case.expected.strip()
detail = None if passed else f"expected={case.expected!r}, got={answer!r}"
return ExpectedCheck(mode="exact", expected=case.expected, passed=passed, detail=detail)
if case.expected_regex is not None:
pattern = re.compile(case.expected_regex)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Batch cases crash on malformed expected fields

_match_expected unconditionally calls string methods/regex compilation on the user-supplied expectations. If a JSONL case provides a non-string expected (or expected_contains) or an invalid expected_regex, these lines raise (e.g., AttributeError or re.error), aborting the entire batch run instead of returning a mismatch/error result that respects --fail-on/--max-fails. The batch CLI therefore cannot tolerate a single malformed case entry.

Useful? React with 👍 / 👎.

@AlexanderOnischenko
Copy link
Owner Author

@codex, please review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 77 to 79
case_p = sub.add_subparsers(dest="case_command")
case_root = sub.add_parser("case", help="Single-case utilities")
case_sub = case_root.add_subparsers(dest="case_command", required=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge CLI parser crashes building subcommands

Any invocation of the demo CLI fails before reaching the handlers because sub is an argparse _SubParsersAction (from parser.add_subparsers), which does not provide add_subparsers; calling it raises AttributeError during parser construction. As a result python -m examples.demo_qa.cli ... aborts immediately instead of running chat/batch/case commands. This line should be removed or the nested subparsers should be added to the case parser instead.

Useful? React with 👍 / 👎.

@AlexanderOnischenko
Copy link
Owner Author

@codex, please review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +184 to +188
def _match_expected(case: Case, answer: str | None) -> ExpectedCheck | None:
if not case.has_asserts:
return None
expected_value = case.expected or case.expected_regex or case.expected_contains or ""
if answer is None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Plan-only runs fail all cases with expectations

In _match_expected a missing answer is treated as a failed assertion (ExpectedCheck(... passed=False, detail="no answer")). When --plan-only is used, run_question never populates artifacts.answer, so any case with expected/expected_* is marked mismatch and counted as a failure even though the user explicitly skipped fetch/synthesis. This makes plan-only mode unusable on asserted datasets (the batch exit code will be non-zero even when nothing actually ran). The expectation check should be skipped or downgraded when plan_only is set.

Useful? React with 👍 / 👎.

Comment on lines 186 to 190
baseline_filter_path = args.only_failed_from
if args.only_failed and not baseline_filter_path:
latest = _load_latest_run(artifacts_dir)
if latest:
baseline_filter_path = latest / "results.jsonl"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Latest-run pointer ignores custom --out results location

--only-failed derives its baseline from runs/latest.txt by appending results.jsonl, but latest.txt always points to the artifacts run folder (written at the end of each batch) and does not take a custom --out into account. If a prior run wrote results elsewhere via --out, a subsequent --only-failed run will look for <run_folder>/results.jsonl, hit load_results, and exit with an error even though the baseline exists at the user-provided path. The latest pointer should reference the actual results path or --only-failed should honor --out when locating the baseline.

Useful? React with 👍 / 👎.

@AlexanderOnischenko
Copy link
Owner Author

@codex, please review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 46 to 49
if fail_on == "error":
bad = {"error"}
elif fail_on == "mismatch":
bad = {"mismatch"}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge --fail-on mismatch treats errors as success

is_failure resets the failure set to only {"mismatch"} when fail_on is "mismatch", dropping "error" and "failed". With that flag, a batch run where cases hit provider/LLM errors will report exit_code = 0 and a clean summary even though nothing executed successfully, which defeats CI/automation that expects errors to fail the run. The mismatch mode should still consider errors failing instead of silently treating them as success.

Useful? React with 👍 / 👎.

@AlexanderOnischenko
Copy link
Owner Author

@codex, please review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 41 to 55
def is_failure(status: str, fail_on: str, require_assert: bool) -> bool:
bad = {"error", "failed", "mismatch"}
unchecked = {"unchecked", "plan_only"}
if require_assert:
bad |= unchecked
if fail_on == "error":
bad = {"error"}
elif fail_on == "mismatch":
bad = {"error", "failed", "mismatch"}
elif fail_on == "unchecked":
bad |= unchecked
elif fail_on == "bad":
bad = {"error", "failed", "mismatch"}
if require_assert:
bad |= unchecked

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor --require-assert when fail_on=mismatch/error

is_failure adds unchecked/plan_only statuses when require_assert is set, but for fail_on == "mismatch" (and "error") it immediately overwrites the set and drops those statuses, so --require-assert is silently ignored unless the user also passes --fail-on bad/unchecked/any. A batch like demo_qa.cli batch --fail-on mismatch --require-assert ... will exit 0 even if every case is unchecked/plan_only, defeating the flag’s intent to fail missing assertions.

Useful? React with 👍 / 👎.

@AlexanderOnischenko
Copy link
Owner Author

@codex, please review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +185 to +189
if not case.has_asserts:
return None
expected_value = case.expected or case.expected_regex or case.expected_contains or ""
if answer is None:
return ExpectedCheck(mode="none", expected=expected_value, passed=False, detail="no answer")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Plan-only runs mark asserted cases as mismatches

Plan-only batch/REPL runs still flow through _match_expected, which returns a failed ExpectedCheck whenever answer is None (lines 185‑189). When run_one(..., plan_only=True) executes cases with expected/expected_regex/expected_contains, those cases are recorded as mismatch rather than plan_only, so is_failure(..., fail_on=bad) triggers a non-zero exit even though fetch/synthesis was intentionally skipped. A plan-only dry run should stay plan_only/unchecked or bypass expectation checks to avoid spurious failures when using --plan-only.

Useful? React with 👍 / 👎.

@AlexanderOnischenko
Copy link
Owner Author

@codex, please review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 166 to 169
bad_statuses = {"mismatch", "failed", "error"}
if require_assert or fail_on in {"unchecked", "any"}:
bad_statuses |= {"unchecked", "plan_only"}
target_ids = {case_id for case_id, res in baseline_for_filter.items() if res.status in bad_statuses}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor fail_on when filtering --only-failed cases

The rerun filter in _select_cases_for_rerun hardcodes bad_statuses to mismatch/failed/error and only extends it for unchecked/plan_only, so it ignores other --fail-on modes such as skipped. With demo_qa batch --only-failed --fail-on skipped, is_failure will treat skipped cases as failures, but they are never selected for rerun here, meaning skipped cases cannot be retried or surfaced even though the caller requested them to fail the build.

Useful? React with 👍 / 👎.

@AlexanderOnischenko
Copy link
Owner Author

@codex, please review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 317 to 321
bad_count = counts.get("mismatch", 0) + counts.get("failed", 0) + counts.get("error", 0)
unchecked = counts.get("unchecked", 0)
plan_only = counts.get("plan_only", 0)
if args.require_assert or args.fail_on in {"unchecked", "any"}:
bad_count += unchecked + plan_only

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include skipped failures in BAD summary when requested

When running demo_qa batch --fail-on skipped, is_failure treats skipped cases as failures and the command exits non‑zero, but the BAD count printed in the summary is computed only from mismatches/errors (lines 317‑321) and never adds skipped cases. That leads to summaries like BAD: 0 even though the exit code indicates failure due solely to skipped cases, which is misleading for users trying to triage runs. Add skipped statuses to the BAD tally when fail_on includes skipped so the summary aligns with the exit criteria.

Useful? React with 👍 / 👎.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 8 to 9
from pydantic import BaseModel, Field, ValidationError, field_validator, model_validator
from pydantic_settings.sources.providers.toml import TomlConfigSettingsSource

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Demo extra pins incompatible pydantic-settings

The new import path for TomlConfigSettingsSource (pydantic_settings.sources.providers.toml) requires the newer pydantic-settings releases, but the demo/dev extras in pyproject.toml still declare pydantic-settings>=2.2. Installing the demo as recommended (pip install -e .[demo]) will therefore pull 2.2.x, and the CLI will now crash at module import with ModuleNotFoundError: No module named 'pydantic_settings.sources.providers' before it can show the friendly dependency message. Please raise the extra’s minimum version to the level implied by this import (or revert to the old import path) so a default install can run the new batch commands.

Useful? React with 👍 / 👎.

@AlexanderOnischenko
Copy link
Owner Author

@codex, please review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 8 to 12
from pydantic import BaseModel, Field, ValidationError, field_validator, model_validator
from pydantic_settings.sources.providers.toml import TomlConfigSettingsSource

try:
from pydantic_settings import BaseSettings, SettingsConfigDict, TomlConfigSettingsSource
from pydantic_settings import BaseSettings, SettingsConfigDict

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fix TomlConfigSettingsSource import path

The settings module now imports TomlConfigSettingsSource from pydantic_settings.sources.providers.toml, but pydantic-settings 2.x exposes this class from pydantic_settings/pydantic_settings.sources and does not ship a sources.providers package. The module raises ModuleNotFoundError as soon as it is imported (before the guarded try/except), so any CLI command or test that loads examples.demo_qa.settings fails even when the declared dependency (pydantic-settings>=2.12) is installed. Please restore the import to the public location so configuration loading works.

Useful? React with 👍 / 👎.

@AlexanderOnischenko
Copy link
Owner Author

@codex, please review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 597 to 600
base_by_id = {res.id: res for res in base_results}
new_by_id = {res.id: res for res in new_results}
all_ids = sorted(new_by_id.keys())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Diff ignores baseline-only cases

The comparison logic only iterates over new_results IDs (all_ids = sorted(new_by_id.keys())), so any case present in the baseline but missing from the new results is silently dropped from new_fail/still_fail and from the JUnit tests count. When a batch run skips or truncates cases (e.g., a crash or running a filtered subset), the compare report will show no regression for those missing cases and the test count will shrink instead of flagging them, masking real failures. Consider diffing over the union of baseline and new IDs so omitted cases are reported explicitly.

Useful? React with 👍 / 👎.

AlexanderOnischenko and others added 4 commits December 21, 2025 10:54
…logic-in-runner.py

Handle missing cases when comparing demo QA runs
…lexanderOnischenko/fetchgraph into codex/add-batch-command-to-demo_qa-cli
@AlexanderOnischenko
Copy link
Owner Author

@codex, please review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +127 to +130
error_artifacts = artifacts or RunArtifacts(run_id=run_id, run_dir=runs_root, question=line)
error_artifacts.error = error_artifacts.error or str(exc)
last_artifacts = error_artifacts
_save_artifacts(error_artifacts)
save_artifacts(error_artifacts)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Save REPL error artifacts to unique run folders

In the REPL fallback path, if the agent run raises before returning a RunResult, the code builds the placeholder RunArtifacts with run_dir=runs_root (the shared parent directory) and then writes error artifacts there. This mixes error files into the root of the runs tree and can overwrite or obscure artifacts from other runs, and the /run command will no longer point to a per-run folder. The error path should use the same per-run directory layout as the success path so failures are isolated and do not clobber existing artifacts.

Useful? React with 👍 / 👎.

AlexanderOnischenko and others added 17 commits December 21, 2025 22:25
…py-into-runs-and-commands

- move run layout, scope, effective, coverage, and case-history helpers into a new examples/demo_qa/runs package
- extract history and report handlers into examples/demo_qa/commands and simplify batch.py orchestration and CLI imports
- move run result writing into a neutral runs/io.py, relocate effective diff reader, and streamline report helper dependencies
- carry effective fail policy into tag reports, clean helper imports, and avoid private reason helpers
…-resilient-to-interruptions

Handle partial batch runs and missed case selection
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants