Crowdsourcing a training dataset of question-and-answer pairs for AI-enabled health information tools on sexually transmitted infections: protocol for a cross-sectional exploratory survey study

  • Elizabeth Oseku*
  • , Petra Kerubo Mariaria
  • , Henry Semakula
  • , Clare Allelua Kahuma
  • , Martin Balaba
  • , Agnes Bwanika Naggirinya
  • , Rachel Lisa King
  • , Rosalind Parkes-Ratanshi
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Background:
Sexually transmitted infections are a significant public health concern, particularly in sub-Saharan Africa, where their prevalence remains high. Promoting awareness and reducing stigma are essential strategies for addressing this challenge, but those affected often have limited access to accurate and culturally appropriate health information. Therefore, innovative solutions are essential to enhance sexual health literacy and encourage informed health-seeking behaviors. Artificial intelligence (AI)–enabled tools, such as chatbots, have emerged as promising avenues for delivering accurate and accessible health information. However, their potential is constrained by the lack of contextualized datasets, which are crucial for ensuring their effectiveness and relevance to diverse populations.

Objective:
This study aims to develop an open access, contextualized dataset of question-and-answer pairs on sexual health and sexually transmitted infections to support the development and training of digital and AI-enabled health information tools.

Methods:
Using a crowdsourcing approach, questions are being collected from participants aged ≥15 years via online platforms, paper-based submissions, and in-person interactions at public events across sub-Saharan Africa. Each question will be anonymized and reviewed by medical professionals who will provide accurate, evidence-based answers. The dataset will then undergo processing, including cleaning and tagging for AI training, ensuring adherence to findability, accessibility, interoperability, and reusability principles. The final dataset will be published as open access.

Results:
Data collection began on June 12, 2024, and is ongoing. The data collection process was piloted in Kigali, Rwanda, where 132 questions were collected. As of August 2025, the study had collected over 5620 question-and-answer pairs. The collected data are undergoing a simultaneous rigorous data processing phase in collaboration with health workers who provide evidence-based answers to the questions and new questions based on their experience in the clinic. The data cleaning and processing will enhance the utility of the data for AI applications.

Conclusions:
The final dataset will be published as open access in 2025, contributing to the development of AI-driven health tools and promoting public health literacy.
Original languageEnglish
Article numbere70005
Number of pages13
JournalJMIR Research Protocols
Volume14
DOIs
Publication statusPublished - 09 Sept 2025

Keywords

  • Sexually transmitted infections
  • artificial intelligence
  • AI
  • health information
  • dataset
  • crowdsourcing

Fingerprint

Dive into the research topics of 'Crowdsourcing a training dataset of question-and-answer pairs for AI-enabled health information tools on sexually transmitted infections: protocol for a cross-sectional exploratory survey study'. Together they form a unique fingerprint.

Cite this