Gemini 3.5 Flash benchmark
Benchmark snapshot across popular evaluation categories. Higher scores are generally better unless noted; source status is tracked separately from SEO keyword demand.
Last updated:
Benchmark snapshot
| Benchmark | Gemini 3.5 Flash | Gemini 3.5 Pro | Kimi K2.6 | GPT-5 (API) | GPT-4o |
|---|---|---|---|---|---|
| MMLU-Pro | 87.3 | 90.2 | 84.6 | 89.6 | 86.1 |
| GPQA | 71.4 | 76.2 | 68.5 | 76.3 | 68.2 |
| HumanEval+ | 92.1 | 94.5 | 88.2 | 94.5 | 90.7 |
| AIME 2024 | 66 | 72.1 | 61.8 | 72.1 | 63.4 |
Latency vs cost
Gemini 3.5 FlashGemini 3.5 ProKimi K2.6GPT-5 (API)GPT-4o
This visual is an implementation placeholder for the launch chart. V1 keeps the data table crawlable and the methodology visible.
Methodology
ModelMeter stores benchmark snapshots with source URLs and capture dates. Watchlist models are clearly labeled when official confirmation is still required. Future refresh jobs run through Cloudflare Cron and Queues.