OpenAI launches GeneBench-Pro to evaluate AI judgment in computational biology

OpenAI unveiled GeneBench-Pro, a synthetic benchmark consisting of 129 problems spanning genomics, quantitative biology and translational medicine. The suite asks models to examine datasets, select analytic strategies and deliver final answers. External domain experts reviewed a subset of the problems for realism and answer identifiability. Leading OpenAI models outperformed earlier versions and most competitors, but pass rates remain modest and transparency is currently limited to a small open-source sample.

Summarize with

ChatGPT Perplexity Claude Grok Gemini

Key Points

GeneBench-Pro contains 129 synthetic problems across genomics, quantitative biology and translational medicine that require models to choose analytic approaches and produce final answers.
OpenAI had 82 problems reviewed by external domain experts for realism and answer identifiability; an expert estimated 20-40 hours of human work per problem.
GPT-5.6 Sol recorded a 28.7% pass rate at the highest reasoning level (31.5% with Pro mode); competitor models scored lower, with Opus 4.8 at 16.0% and several others in single digits.

OpenAI introduced GeneBench-Pro on Tuesday, a benchmark specifically designed to assess whether artificial intelligence systems can perform the judgment calls typical of computational biology research. The collection comprises 129 distinct problems drawn from domains that include genomics, quantitative biology and translational medicine.

Each problem supplies a model with three elements: a dataset, contextual information about the experimental setup, and a target question. Models are expected not only to analyze the data but also to decide on an appropriate analytical approach and to produce a final answer based on that approach.

OpenAI routed 82 of the 129 problems to outside domain experts - a group that included graduate students, postdoctoral researchers, industry scientists and university professors - to obtain independent assessments. Reviewers evaluated each problem for realism and for whether the intended target answer could be identified from the information provided. Alexander Strudwick Young, an assistant professor in human genetics at UCLA, said the problems would have posed a challenge for a graduate student working without supervisory feedback.

All problems in GeneBench-Pro are synthetically generated, with OpenAI controlling the complete data-generation process. That design allows the company to verify correctness by comparing model outputs against known targets and to accommodate reasonable variations in analytical choices while still accepting valid solutions.

On performance, OpenAI reported that its GPT-5.6 Sol model achieved a pass rate of 28.7% at the highest reasoning level, rising to 31.5% when Pro mode was enabled. By comparison, an earlier generation, GPT-5, scored below 5% when the team began constructing the original GeneBench. At the lowest reasoning level in the new benchmark, GPT-5.6 Sol recorded a single-digit pass rate.

Competing models generally matched or fell short of the performance of their corresponding GPT counterparts at the time of each release. Performance figures published by OpenAI for selected competitors include: Opus 4.8 at 16.0%, Gemini 3.5 Flash at 8.1%, Gemini 3.1 Pro at 3.1%, Grok 4.3 at 1.5%, GLM 5.2 at 4.6% and DeepSeek V4 Pro at 2.4%.

OpenAI’s documentation for GeneBench-Pro includes an estimate of the human effort required to solve a typical problem: reviewers suggested that an expert would need roughly 20 to 40 hours. Using an hourly rate of $200, the estimated human labor cost for a single problem therefore runs into the thousands of dollars, while current inference costs are described as only several dollars per problem.

To support independent evaluation, OpenAI is open-sourcing 10 representative questions on Hugging Face and will supply a 50-question subset to Artificial Analysis for external benchmarking. The remainder of the benchmark remains under OpenAI’s control.

Implications for research and industry

GeneBench-Pro targets the decision-making steps of computational biology workflows rather than isolated subtasks, which makes it a test of end-to-end analytical judgment.
Although top-performing models show progress compared with earlier generations, overall pass rates indicate substantial room for improvement before systems reliably match expert-level performance on these synthetic research tasks.
Open-source and independent evaluation provisions are limited to small subsets of the full benchmark, leaving broader external validation constrained for now.

Risks

Pass rates for even the top model remain modest, indicating uncertain readiness of current AI systems to make reliable, expert-level judgments in computational biology - this affects biotech and pharmaceutical research workflows.
Because each problem is synthetically generated and OpenAI controls data generation and grading, it is unclear from GeneBench-Pro alone how models would perform on uncurated, real-world datasets - a limitation for translational applications.
Only a small portion of the benchmark has been open-sourced or shared with an independent evaluator, which constrains external verification of results and transparency for researchers and industry stakeholders.

Menu

OpenAI launches GeneBench-Pro to evaluate AI judgment in computational biology

Key Points

Risks

More from Stock Markets