OpenAI Releases LifeSciBench, a 750-Task Benchmark Grading AI Models in Real Life Science Research with an Expertly Authored Rubric

0 0 4 minutes read

OpenAI Releases LifeSciBench, a 750-Task Benchmark Grading AI Models in Real Life Science Research with an Expertly Authored Rubric

Most biological benchmarks ask small, fact-based questions with clean answers. Scientists weigh incomplete evidence and make decisions. OpenAI released LifeSciBench and directly addresses that gap.

Even the most powerful model goes through about one third of the work. The benchmark is far from perfect.

What is LifeSciBench

LifeSciBench contains 750 jobs written by experts. They span seven workflows and seven biological domains. Each assignment is accompanied by information, supporting artifacts, and a grading rubric.

Seven workflows include managing and analyzing evidence. They also include design and optimization, scientific reasoning, validation and implementation, translation, and scientific communication.

The seven domains range from genomics and medicinal chemistry to clinical and translational sciences.

Tasks are written as a scientist would tell a colleague. They are free answer, not multiple choice. About 79% need multiple steps to think or make decisions, an average of four steps each.

How the Benchmark is created

A group of 173 expert scientists wrote these works. Each had a Ph.D. and had experience in biotechnology or pharmaceuticals. Accepted works have an average of six rounds of automatic review and at least two expert reviews.

Many works are sent with artwork. The benchmark includes a total of 1,062 embedded artifacts. About 53% of the tasks require at least one artifact. Types include sequences, figures, tables, PDFs, and chemical structures.

A unique collection of guaranteed quality. There were 453 reviewers, and 97% had doctoral degrees. Overall agreement exceeded 96% on relevance, conceptualization, support and usability.

Rubric System

Rubrics are the key mechanics here. They contain 19,020 criteria in the entire benchmark. That’s about 25 terms per job.

Each criterion awards one tangible asset. Examples include a specific fact, a reference step, or a numerical answer within a tolerance. Grading runs against a rubric, not a single reference string.

Two metrics summarize performance. A standard rubric score divides the points awarded by the total points. A passing grade for an assignment counts assignments with a score of 70 or more.

This distinction is important for interpretation. The response can receive partial credit while failing the task. The pass limit is tight by design.

Here’s how to get points in plain Python:

def grade(rubric, awarded_ids):
    total = sum(c["pts"] for c in rubric)
    earned = sum(c["pts"] for c in rubric if c["id"] in awarded_ids)
    normalized = earned / total          # partial credit
    passed = normalized >= 0.70          # task-level success
    return normalized, passed

How Models Work

OpenAI tested five models in a single-turn setting. Each model saw information and artifacts once. Unrestricted internet browsing is allowed.

Model	Average score	The pass rate of the job
GPT-Rosalind	0.576	36.1%
GPT-5.5	0.519	25.7%
Gemini 3.1 Pro	0.515	23.6%
GPT-5.4	0.479	20.7%
Grok 4.3	0.399	13.0%

GPT-Rosalind, a domain-specific model for OpenAI, which is led as a whole. It had the highest number of jobs per job at 386 out of 750 jobs. It also increased the overall pass rate in GPT-5.5, from 25.7% to 36.1%. Passing rates remain decent for all models.

Standards are not the whole story. Gemini 3.1 Pro led exclusively in 214 activities. Aggregate scores can mask job-specific strengths.

When Models Succeed, and When They Fall Short

The models were more powerful in structured judgments. GPT-Rosalind achieved an average score of 0.712 in translation. Science Communication scored 0.718, but that category is small, so read carefully.

The two workflows remained difficult. Design, Optimization, and Prediction were among the most difficult, with GPT-Rosalind scoring 30.7%. Analysis was behind at 30.3%.

The use of the artifact was a clear bottleneck. GPT-Rosalind dropped from 45.1% in text-only tasks to 28.1% in artifact tasks. GPT-5.5 dropped similarly, from 29.9% to 21.9%.

The direct consequences were the most difficult. The success of the sequence and structure criteria ranged from 46.9% to 18.0% for all models. The advantage of GPT-Rosalind over GPT-5.5 in production/building materials was only +0.001.

The models are also standing during the work. In GPT-Rosalind, 109 assignments earned at least 50% rubric credit but still passed less than 20%.

Headroom remains large. No model was successful in 171 jobs (22.8%). And 261 jobs (34.8%) had a best model pass rate of less than 20%.

Strengths and Weaknesses

Power:

Comprehensive coverage of seven workflows and seven biological domains
Expert-authored rubrics with 19,020 atomic, measurable criteria
Virtual artifacts: sequences, figures, tables, PDFs, and structures
Independent verification by 453 expert reviewers, 97% with doctoral degrees

Weaknesses:

Only one turn; real research is iterative and iterates over and over again
It was developed by OpenAI, which also provides highly tested models
Public release may be limited by security and license restrictions
750 jobs cannot cover all science skills

Try it: Interactive Rubric Grader Demo

LifeSciBench – Interactive Demo

Rubric Grader & Model Leaderboard

See how rubric-based grading works on a real-world assignment. Change the criteria for the model to be “correct” and view the average score and 70% pass update live.

Activity (Analysis – Spatial Transcriptomics): Using the attached Visium data from the FFPE cervical cancer slide, group the spots into 4 k-means groups, define the predominant cell type for each group, and recommend 1–2 highly targeted therapies (ADC, TCE, or CAR-T) based on differences in antigen expression between non-tumor regions.

Model the answer:

0 / 76 points

Average score: 0%

▲ 70% pass threshold (53.2 points)

FAIL – less than 70%

A response can collect part of the credit but not complete the task. That gap is exactly what LifeSciBench measures.

One curve test; unrestricted internet browsing is allowed. GPT-Rosalind led overall but by contrast only passed 386 of the 750 tasks; Gemini 3.1 Pro led with 214 exceptions.

Check it out Paper again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.