Skip to content

Benchmark Results

We benchmarked all 10 supported models to understand which ones actually find real issues vs. generate noise. This data directly informs our recommended model tiers.

Methodology

  • Repos: python-dotenv (34 files, ~52K tokens) and noxaudit itself (88 files, ~126K tokens)
  • Focus: All 7 areas (security, docs, patterns, testing, hygiene, dependencies, performance)
  • Method: Batch API on all providers (50% discount), 1 run per model per repo
  • Quality validation: Cross-model consensus — issues found by 4+ models (out of 10) are considered "real"
  • Total spend: $2.13

Scorecard

Model dotenv noxaudit Total Findings Cost $/finding
gpt-5-nano 4 6 10 $0.01 $0.001
gpt-5-mini 15 24 39 $0.03 $0.001
gemini-2.5-flash 18 16 34 $0.07 $0.002
gemini-3-flash-preview 8 10 18 $0.10 $0.005
claude-haiku-4-5 24 15 39 $0.11 $0.003
o4-mini 8 6 14 $0.20 $0.014
gpt-5.4 32 52 84 $0.26 $0.003
gemini-2.5-pro 17 21 38 $0.33 $0.009
claude-sonnet-4-6 30 48 78 $0.38 $0.005
claude-opus-4-6 40 51 91 $0.65 $0.007

Quality Analysis

python-dotenv served as a "canary" — it's a small, well-maintained package, so we can manually verify whether findings are real. We identified 6 confirmed real issues via cross-model consensus (found by 4+ models):

Issue Models (of 10) Verdict
get_cli_string shell injection risk 8 Real — genuine security concern
test_list uses builtin format instead of output_format 6 Real — actual code bug
Duplicate files (README/CHANGELOG/CONTRIBUTING in docs/) 6 Real — maintenance burden
Broken mkdocs link (empty href) 5 Real — broken documentation
Unpinned dev dependencies 5 Real — reproducibility issue
Incorrect pre-commit command (precommit vs pre-commit) 4 Real — wrong package name

Per-Model Quality

Model Consensus (of 6) Noise Level Cost Verdict
claude-sonnet-4-6 6/6 Low $0.38 Best precision
gpt-5.4 5/6 Low $0.26 Best mid-tier
gpt-5-mini 5/6 Low $0.03 Best daily value
claude-opus-4-6 6/6 Moderate $0.65 Most findings overall
claude-haiku-4-5 4/6 Moderate $0.11 Decent but pads with nits
gemini-2.5-pro 3/6 Low $0.33 Poor value vs gpt-5.4
o4-mini 3/6 Moderate $0.20 Reasoning tokens wasted
gemini-2.5-flash 2/6 Moderate $0.07 Misses too much
gemini-3-flash-preview 2/6 Low $0.10 Preview — fewer findings than 2.5-flash
gpt-5-nano 2/6 Low $0.01 Too shallow

Based on quality-adjusted cost:

Tier Model Cost/Run Rationale
Daily gpt-5-mini $0.03 5/6 consensus issues, minimal noise, cheapest viable model
Deep dive gpt-5.4 $0.26 84 findings total, beats Sonnet quality at 68% the cost
Premium claude-opus-4-6 $0.65 Most findings overall, best for maximum depth

Note

Our initial assumption was "Gemini Flash for daily audits" — the benchmark disproved this. gpt-5-mini is cheaper AND finds more real issues.

Dropped Models

  • o3: 0 findings on python-dotenv, 7 on noxaudit at $0.33. Reasoning tokens wasted on non-reasoning task. Removed from supported models.
  • gemini-2.0-flash: Deprecated. Returns errors in batch API.

Notes

  • All costs include 50% batch API discount
  • OpenAI reasoning models (o3, o4-mini) bill hidden reasoning tokens as output — poor cost efficiency for auditing tasks
  • python-dotenv's small size makes it a good canary: high finding counts on a clean repo may indicate hallucination
  • Different models genuinely find different things — only 6 issues had cross-model consensus, confirming the value of provider rotation