Accepted In Digital Discovery · DOI 10.1039/D5DD00519A

OpenXRD shows when scientific context helps models, and when it hurts.

OpenXRD is a crystallography benchmark for researchers, students, and agents-for-science builders. It evaluates 74 language and multimodal models on 217 expert-curated X-ray diffraction questions spanning 81 subtasks, then measures how performance changes when models receive carefully written supporting passages.

Closed-book The model answers from internal knowledge only, with no outside help.
Open-book The model gets a short, question-specific passage that supports reasoning without leaking the answer.
Context Assimilation The benchmark measures whether that added context is actually turned into better scientific reasoning.
217 expert-reviewed questions
81 crystallography subtasks
74 evaluated LLMs and MLLMs
03·09·2026 paper accepted
Main Finding
The sweet spot is not the biggest model.

Mid-sized models, especially in the 7B-70B range, benefit the most from expert-reviewed context. Frontier models often already know enough crystallography that extra passages create saturation or interference instead of improvement.

+11.52% LLaVA-v1.6-34B gains with expert-reviewed materials.
-5.78% Grok-4-fast drops despite extra expert context.
78.34% LLaVA-v1.6-34B open-book accuracy with expert materials.
84.33% Llama-3.1-405B open-book accuracy for comparison.
Read and cite the paper before using the dataset. The benchmark is evaluation-only and is released for scientific assessment, not for model training or distillation.
  • AcceptedMarch 9, 2026 in Digital Discovery
  • First PublishedMarch 16, 2026 by the Royal Society of Chemistry
  • Paper FocusCrystallography QA, open-book reasoning, and context assimilation
  • Project Linkgithub.com/niaz60/OpenXRD

Study Design At A Glance

One accepted-paper figure anchors the page, but the story is in what happens when context quality and model capacity interact.

OpenXRD study design figure comparing parametric knowledge baseline with context-augmented inference.
Accepted-Paper Figure

OpenXRD isolates the effect of guidance without retraining the model.

Every model sees the same benchmark questions. The difference is whether it answers from internal knowledge alone or with a curated supporting passage. That makes the benchmark useful for studying retrieval-style systems, scientific copilots, and agents that must decide whether added context is helping.

  • Controlled comparison: same questions, same answer format, different context conditions.
  • Quality matters: expert-reviewed passages outperform AI-generated passages even when token counts are matched.
  • Capacity matters: mid-sized models gain the most, while frontier models often plateau or degrade.

What Surprised Us

These are the results that make the benchmark useful beyond a single crystallography domain paper.

Finding 01

Mid-sized models get the biggest boost.

The clearest gains appear in models that have enough reasoning capacity to use extra context, but not enough domain knowledge to already saturate.

7B-70B models are the main beneficiaries.
Finding 02

More context can make top models worse.

Several frontier systems lose accuracy when extra expert-reviewed passages repeat or reframe knowledge they already carry internally.

Interference is a real deployment risk.
Finding 03

Quality beats quantity.

Expert-reviewed supporting materials outperform AI-generated alternatives even when the token budget is held constant, pointing to pedagogy and relevance rather than prompt length alone.

Matched tokens, different outcomes.
Finding 04

Smaller systems can get surprisingly competitive.

When paired with strong support passages, a much smaller model can approach the performance of far larger systems at a fraction of the deployment cost.

LLaVA-v1.6-34B: 78.34% vs. Llama-3.1-405B: 84.33%.

Counterintuitive result: some of the strongest models degrade in open-book mode.

The accepted paper reports measurable drops for several frontier systems when expert-reviewed materials are added: not because the materials are wrong, but because redundant or stylistically mismatched context can disrupt an already strong internal representation.

GPT-5 −3.68%
GPT-4.5-preview −3.23%
Grok-4-fast −5.78%
Claude-3.5-Sonnet −1.84%
Coverage

Broad enough for teaching, sharp enough for scientific agents.

OPENXRD does not cover a single trick question family. The benchmark spans fundamentals, geometry, structure analysis, scattering, phase reasoning, and mathematically demanding subtasks that still expose failure modes even in strong models.

81 distinct crystallography subtasks represented in the accepted benchmark.
Sample Task Bands

Foundations

Bragg diffraction Powder method Laue method Crystal systems

Structure And Symmetry

Unit cell analysis Coordination numbers Crystal twinning Space-group interpretation

Scattering And Reciprocal Space

Structure factors Scattering factors Reciprocal-space concepts Interference conditions

Interpretation And Practice

Peak indexing Diffraction effects Microstructure effects Experimental limitations
The site uses a cleaner taxonomy view instead of the paper word cloud because the public page needs benchmark coverage, not model-specific labeling noise.

Why This Matters

The page is meant to be readable for students entering the field and useful for researchers building scientific copilots.

Materials Students

See where models help, and where they still fail.

OpenXRD makes it easy to compare confident scientific recall against genuinely brittle reasoning, especially on structural analysis and mathematically intensive tasks.

CS Students

Study scientific discovery without conflating retrieval and reasoning.

Because the benchmark fixes the supporting passage, it separates retrieval quality from a model's ability to assimilate scientific evidence during inference.

Agents For Science

Learn when extra context helps, saturates, or interferes.

That matters when designing agents that search literature, summarize evidence, or decide whether to trust an added reference at all.

Use OpenXRD In Three Steps

The repo stays usable, but it now sits below the research teaser rather than replacing it.

1

Install The Package

Clone the repo and bootstrap the local environment. The installer prefers uv and falls back to standard venv plus pip when needed.

git clone https://github.com/niaz60/OpenXRD.git cd OpenXRD ./scripts/install.sh
2

Acknowledge The Dataset Policy

The public archive stays zipped. Extraction requires an explicit acknowledgment because the dataset is released for evaluation and benchmarking only.

./scripts/unzip_dataset.sh --acknowledge openxrd-check
3

Run A Short Evaluation

Set an API key for OpenAI or OpenRouter, choose a model explicitly, and run a small sample.

export OPENAI_API_KEY="your-openai-key" openxrd-example --provider openai --model your-model --limit 3
Dataset Policy

Evaluation Only

  • Use the dataset for benchmarking and scientific evaluation.
  • Do not use it for training, fine-tuning, distillation, or alignment.
  • Cite the accepted paper when using the benchmark or reporting results.
  • The dataset is zipped to reduce casual scraping and accidental ingestion by aggregators, not to act as strong access control.
Citation

Read The Work. Cite The Work.

If OpenXRD informs a paper, benchmark comparison, scientific agent evaluation, classroom use, or dataset-based analysis, cite the accepted Digital Discovery article rather than an older preprint-only reference.

@article{Vosoughi_2025, title = {OPENXRD: A Comprehensive Benchmark Framework for LLM/MLLM XRD Question Answering}, author = {Vosoughi, Ali and Shahnazari, Ayoub and Zhang, Zeliang and Xi, Yufeng and Hess, Griffin and Xu, Chenliang and Abdolrahim, Niaz}, journal = {Digital Discovery}, publisher = {Royal Society of Chemistry (RSC)}, year = {2025}, doi = {10.1039/D5DD00519A}, url = {https://doi.org/10.1039/D5DD00519A} }