OBELLA: open the book for evaluating long-form large language model answers in open-domain question answering

Research output: Chapter in Book/Report/Conference proceedingConference contribution

20 Downloads (Pure)

Abstract

Reliable factuality evaluation is critical for the iterative development of open-domain question answering (ODQA) systems, especially given the rise of large language models (LLMs) and their propensity for hallucination. However, state-of-the-art (SOTA) automatic metrics, which are mostly supervised, remain notably less reliable than humans. In this paper, we find two key challenges behind this gap: (1) length distribution mismatch between lengthy LLM answers and shorter training answers used by current metrics; and (2) reference incompleteness, where current metrics often misjudge valid system answers absent from given references-a challenge worsened by the diversity of LLM outputs. To address these issues, we present a new ODQA factuality evaluation dataset called OBELLA (Open-Book Evaluation for Long-form LLM Answers). OBELLA narrows the length distribution mismatch by significantly increasing the candidate answer length to align with LLM outputs. Moreover, it introduces a neutral class for plausible yet under-supported candidate answers to differentiate reference incompleteness from outright incorrectness, thus enabling flexible reevaluation by consulting external knowledge for more references. Based on OBELLA, we propose a novel metric named OBELLAM (OBELLA Metric). OBELLAM integrates a cross-attention mechanism to enhance long-form candidate answer representations and employs a dynamic closed-open book evaluation strategy to tackle reference incompleteness. Our OBELLAM sets a new SOTA in aligning with human judgments across two ODQA evaluation benchmarks, marking a promising step toward more robust ODQA factuality evaluation.

Original languageEnglish
Title of host publicationSIGIR '25: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery
Pages1109-1119
Number of pages11
ISBN (Electronic)9798400715921
DOIs
Publication statusPublished - 13 Jul 2025
Event48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025 - Padova Congress center, Padua, Italy
Duration: 13 Jul 202518 Jul 2025
https://sigir2025.dei.unipd.it/

Publication series

NameSIGIR 2025 - Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval

Conference

Conference48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025
Country/TerritoryItaly
CityPadua
Period13/07/202518/07/2025
Internet address

Bibliographical note

Publisher Copyright:
© 2025 Copyright held by the owner/author(s).

Keywords

  • Factuality Evaluation
  • Large Language Models
  • Open-Domain Question Answering

ASJC Scopus subject areas

  • Information Systems
  • Software

Fingerprint

Dive into the research topics of 'OBELLA: open the book for evaluating long-form large language model answers in open-domain question answering'. Together they form a unique fingerprint.

Cite this