Benchmark Results¶

We benchmarked all 10 supported models to understand which ones actually find real issues vs. generate noise. This data directly informs our recommended model tiers.

Methodology¶

Repos: python-dotenv (34 files, ~52K tokens) and noxaudit itself (88 files, ~126K tokens)
Focus: All 7 areas (security, docs, patterns, testing, hygiene, dependencies, performance)
Method: Batch API on all providers (50% discount), 1 run per model per repo
Quality validation: Cross-model consensus — issues found by 4+ models (out of 10) are considered "real"
Total spend: $2.13

Scorecard¶

Model	dotenv	noxaudit	Total Findings	Cost	$/finding
gpt-5-nano	4	6	10	$0.01	$0.001
gpt-5-mini	15	24	39	$0.03	$0.001
gemini-2.5-flash	18	16	34	$0.07	$0.002
gemini-3-flash-preview	8	10	18	$0.10	$0.005
claude-haiku-4-5	24	15	39	$0.11	$0.003
o4-mini	8	6	14	$0.20	$0.014
gpt-5.4	32	52	84	$0.26	$0.003
gemini-2.5-pro	17	21	38	$0.33	$0.009
claude-sonnet-4-6	30	48	78	$0.38	$0.005
claude-opus-4-6	40	51	91	$0.65	$0.007

Quality Analysis¶

python-dotenv served as a "canary" — it's a small, well-maintained package, so we can manually verify whether findings are real. We identified 6 confirmed real issues via cross-model consensus (found by 4+ models):

Issue	Models (of 10)	Verdict
`get_cli_string` shell injection risk	8	Real — genuine security concern
`test_list` uses builtin `format` instead of `output_format`	6	Real — actual code bug
Duplicate files (README/CHANGELOG/CONTRIBUTING in docs/)	6	Real — maintenance burden
Broken mkdocs link (empty href)	5	Real — broken documentation
Unpinned dev dependencies	5	Real — reproducibility issue
Incorrect pre-commit command (`precommit` vs `pre-commit`)	4	Real — wrong package name

Per-Model Quality¶

Model	Consensus (of 6)	Noise Level	Cost	Verdict
claude-sonnet-4-6	6/6	Low	$0.38	Best precision
gpt-5.4	5/6	Low	$0.26	Best mid-tier
gpt-5-mini	5/6	Low	$0.03	Best daily value
claude-opus-4-6	6/6	Moderate	$0.65	Most findings overall
claude-haiku-4-5	4/6	Moderate	$0.11	Decent but pads with nits
gemini-2.5-pro	3/6	Low	$0.33	Poor value vs gpt-5.4
o4-mini	3/6	Moderate	$0.20	Reasoning tokens wasted
gemini-2.5-flash	2/6	Moderate	$0.07	Misses too much
gemini-3-flash-preview	2/6	Low	$0.10	Preview — fewer findings than 2.5-flash
gpt-5-nano	2/6	Low	$0.01	Too shallow

Recommended Tiers¶

Based on quality-adjusted cost:

Tier	Model	Cost/Run	Rationale
Daily	`gpt-5-mini`	$0.03	5/6 consensus issues, minimal noise, cheapest viable model
Deep dive	`gpt-5.4`	$0.26	84 findings total, beats Sonnet quality at 68% the cost
Premium	`claude-opus-4-6`	$0.65	Most findings overall, best for maximum depth

Note

Our initial assumption was "Gemini Flash for daily audits" — the benchmark disproved this. gpt-5-mini is cheaper AND finds more real issues.

Dropped Models¶

o3: 0 findings on python-dotenv, 7 on noxaudit at $0.33. Reasoning tokens wasted on non-reasoning task. Removed from supported models.
gemini-2.0-flash: Deprecated. Returns errors in batch API.

Notes¶

All costs include 50% batch API discount
OpenAI reasoning models (o3, o4-mini) bill hidden reasoning tokens as output — poor cost efficiency for auditing tasks
python-dotenv's small size makes it a good canary: high finding counts on a clean repo may indicate hallucination
Different models genuinely find different things — only 6 issues had cross-model consensus, confirming the value of provider rotation