OpenAI introduced GeneBench-Pro on Tuesday, a benchmark specifically designed to assess whether artificial intelligence systems can perform the judgment calls typical of computational biology research. The collection comprises 129 distinct problems drawn from domains that include genomics, quantitative biology and translational medicine.
Each problem supplies a model with three elements: a dataset, contextual information about the experimental setup, and a target question. Models are expected not only to analyze the data but also to decide on an appropriate analytical approach and to produce a final answer based on that approach.
OpenAI routed 82 of the 129 problems to outside domain experts - a group that included graduate students, postdoctoral researchers, industry scientists and university professors - to obtain independent assessments. Reviewers evaluated each problem for realism and for whether the intended target answer could be identified from the information provided. Alexander Strudwick Young, an assistant professor in human genetics at UCLA, said the problems would have posed a challenge for a graduate student working without supervisory feedback.
All problems in GeneBench-Pro are synthetically generated, with OpenAI controlling the complete data-generation process. That design allows the company to verify correctness by comparing model outputs against known targets and to accommodate reasonable variations in analytical choices while still accepting valid solutions.
On performance, OpenAI reported that its GPT-5.6 Sol model achieved a pass rate of 28.7% at the highest reasoning level, rising to 31.5% when Pro mode was enabled. By comparison, an earlier generation, GPT-5, scored below 5% when the team began constructing the original GeneBench. At the lowest reasoning level in the new benchmark, GPT-5.6 Sol recorded a single-digit pass rate.
Competing models generally matched or fell short of the performance of their corresponding GPT counterparts at the time of each release. Performance figures published by OpenAI for selected competitors include: Opus 4.8 at 16.0%, Gemini 3.5 Flash at 8.1%, Gemini 3.1 Pro at 3.1%, Grok 4.3 at 1.5%, GLM 5.2 at 4.6% and DeepSeek V4 Pro at 2.4%.
OpenAI’s documentation for GeneBench-Pro includes an estimate of the human effort required to solve a typical problem: reviewers suggested that an expert would need roughly 20 to 40 hours. Using an hourly rate of $200, the estimated human labor cost for a single problem therefore runs into the thousands of dollars, while current inference costs are described as only several dollars per problem.
To support independent evaluation, OpenAI is open-sourcing 10 representative questions on Hugging Face and will supply a 50-question subset to Artificial Analysis for external benchmarking. The remainder of the benchmark remains under OpenAI’s control.
Implications for research and industry
- GeneBench-Pro targets the decision-making steps of computational biology workflows rather than isolated subtasks, which makes it a test of end-to-end analytical judgment.
- Although top-performing models show progress compared with earlier generations, overall pass rates indicate substantial room for improvement before systems reliably match expert-level performance on these synthetic research tasks.
- Open-source and independent evaluation provisions are limited to small subsets of the full benchmark, leaving broader external validation constrained for now.