OpenXRD is a comprehensive evaluation framework for testing Large Language Models (LLMs) and Multimodal LLMs on X-ray diffraction and crystallography questions. Our benchmark includes 217 expert-reviewed questions and evaluates model performance in both open-book and closed-book settings using AI-generated and expert-reviewed supporting materials.
Expert-curated multiple-choice questions covering fundamental to advanced crystallography concepts, from basic definitions to complex structural analysis.
Compare model performance with and without supporting materials to understand how external knowledge affects AI reasoning in specialized domains.
GPT-4.5 generated contextual explanations, refined through expert review, to enhance model understanding without revealing answers.
Evaluate performance across GPT-4 variants, O-series models, and LLaVA-based vision-language models on crystallographic reasoning tasks.
Our research reveals an inverted-U relationship in model improvement: mid-capacity models (7B-34B parameters) benefit most from external knowledge, gaining up to +11.5% accuracy with expert-reviewed materials. Larger models show minimal improvement, while smaller models have limited capacity to utilize additional information effectively. This suggests cost-effective deployment strategies where mid-sized models with expert-reviewed materials can approach larger model performance on specialized scientific tasks.