Abstract
Reliable factuality evaluation is critical for the iterative development of open-domain question answering (ODQA) systems, especially given the rise of large language models (LLMs) and their propensity for hallucination. However, state-of-the-art (SOTA) automatic metrics, which are mostly supervised, remain notably less reliable than humans. In this paper, we find two key challenges behind this gap: (1) length distribution mismatch between lengthy LLM answers and shorter training answers used by current metrics; and (2) reference incompleteness, where current metrics often misjudge valid system answers absent from given references-a challenge worsened by the diversity of LLM outputs. To address these issues, we present a new ODQA factuality evaluation dataset called OBELLA (Open-Book Evaluation for Long-form LLM Answers). OBELLA narrows the length distribution mismatch by significantly increasing the candidate answer length to align with LLM outputs. Moreover, it introduces a neutral class for plausible yet under-supported candidate answers to differentiate reference incompleteness from outright incorrectness, thus enabling flexible reevaluation by consulting external knowledge for more references. Based on OBELLA, we propose a novel metric named OBELLAM (OBELLA Metric). OBELLAM integrates a cross-attention mechanism to enhance long-form candidate answer representations and employs a dynamic closed-open book evaluation strategy to tackle reference incompleteness. Our OBELLAM sets a new SOTA in aligning with human judgments across two ODQA evaluation benchmarks, marking a promising step toward more robust ODQA factuality evaluation.
| Original language | English |
|---|---|
| Title of host publication | SIGIR '25: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval |
| Publisher | Association for Computing Machinery |
| Pages | 1109-1119 |
| Number of pages | 11 |
| ISBN (Electronic) | 9798400715921 |
| DOIs | |
| Publication status | Published - 13 Jul 2025 |
| Event | 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025 - Padova Congress center, Padua, Italy Duration: 13 Jul 2025 → 18 Jul 2025 https://sigir2025.dei.unipd.it/ |
Publication series
| Name | SIGIR 2025 - Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval |
|---|
Conference
| Conference | 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025 |
|---|---|
| Country/Territory | Italy |
| City | Padua |
| Period | 13/07/2025 → 18/07/2025 |
| Internet address |
Bibliographical note
Publisher Copyright:© 2025 Copyright held by the owner/author(s).
Keywords
- Factuality Evaluation
- Large Language Models
- Open-Domain Question Answering
ASJC Scopus subject areas
- Information Systems
- Software