LLM/MLLM Crystallography Benchmark & Enhancement Framework
🤖 AI Evaluation 📊 217-Question Benchmark 🔬 Crystallography QA 📚 Open/Closed-Book Testing

Benchmarking AI Models on Crystallography Knowledge

OpenXRD is a comprehensive evaluation framework for testing Large Language Models (LLMs) and Multimodal LLMs on X-ray diffraction and crystallography questions. Our benchmark includes 217 expert-reviewed questions and evaluates model performance in both open-book and closed-book settings using AI-generated and expert-reviewed supporting materials.

217-Question Benchmark

Expert-curated multiple-choice questions covering fundamental to advanced crystallography concepts, from basic definitions to complex structural analysis.

📖

Open vs Closed-Book Evaluation

Compare model performance with and without supporting materials to understand how external knowledge affects AI reasoning in specialized domains.

🤖

AI-Generated Supporting Materials

GPT-4.5 generated contextual explanations, refined through expert review, to enhance model understanding without revealing answers.

⚖️

Comprehensive Model Comparison

Evaluate performance across GPT-4 variants, O-series models, and LLaVA-based vision-language models on crystallographic reasoning tasks.

Key Findings

Our research reveals an inverted-U relationship in model improvement: mid-capacity models (7B-34B parameters) benefit most from external knowledge, gaining up to +11.5% accuracy with expert-reviewed materials. Larger models show minimal improvement, while smaller models have limited capacity to utilize additional information effectively. This suggests cost-effective deployment strategies where mid-sized models with expert-reviewed materials can approach larger model performance on specialized scientific tasks.

Benchmark Statistics

217
Expert Questions
14
Models Tested
+11.5%
Max Improvement
2
Evaluation Modes