Two-part Segmentation of Text Documents

Deepak Padmanabhan, Karthik Visweswariah, Nirmalie Wiratunga, Sadiq Sani

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

We consider the problem of segmenting text documents that have a
two-part structure such as a problem part and a solution part. Documents
of this genre include incident reports that typically involve
description of events relating to a problem followed by those pertaining
to the solution that was tried. Segmenting such documents
into the component two parts would render them usable in knowledge
reuse frameworks such as Case-Based Reasoning. This segmentation
problem presents a hard case for traditional text segmentation
due to the lexical inter-relatedness of the segments. We develop
a two-part segmentation technique that can harness a corpus
of similar documents to model the behavior of the two segments
and their inter-relatedness using language models and translation
models respectively. In particular, we use separate language models
for the problem and solution segment types, whereas the interrelatedness
between segment types is modeled using an IBM Model
1 translation model. We model documents as being generated starting
from the problem part that comprises of words sampled from
the problem language model, followed by the solution part whose
words are sampled either from the solution language model or from
a translation model conditioned on the words already chosen in the
problem part. We show, through an extensive set of experiments on
real-world data, that our approach outperforms the state-of-the-art
text segmentation algorithms in the accuracy of segmentation, and
that such improved accuracy translates well to improved usability
in Case-based Reasoning systems. We also analyze the robustness
of our technique to varying amounts and types of noise and empirically
illustrate that our technique is quite noise tolerant, and
degrades gracefully with increasing amounts of noise
LanguageEnglish
Title of host publication21st ACM International Conference on Information and Knowledge Management, CIKM'12, Maui, HI, USA, October 29 - November 02, 2012.
Pages793-802
Number of pages10
Publication statusPublished - 2012
EventCIKM 2012 - Hawaii, Maui, United States
Duration: 29 Oct 201202 Nov 2012

Conference

ConferenceCIKM 2012
CountryUnited States
CityMaui
Period29/10/201202/11/2012

Fingerprint

Case based reasoning
Experiments

Cite this

Padmanabhan, D., Visweswariah, K., Wiratunga, N., & Sani, S. (2012). Two-part Segmentation of Text Documents. In 21st ACM International Conference on Information and Knowledge Management, CIKM'12, Maui, HI, USA, October 29 - November 02, 2012. (pp. 793-802)
Padmanabhan, Deepak ; Visweswariah, Karthik ; Wiratunga, Nirmalie ; Sani, Sadiq. / Two-part Segmentation of Text Documents. 21st ACM International Conference on Information and Knowledge Management, CIKM'12, Maui, HI, USA, October 29 - November 02, 2012.. 2012. pp. 793-802
@inproceedings{b38ed1abafd645baab901d7a12099807,
title = "Two-part Segmentation of Text Documents",
abstract = "We consider the problem of segmenting text documents that have atwo-part structure such as a problem part and a solution part. Documentsof this genre include incident reports that typically involvedescription of events relating to a problem followed by those pertainingto the solution that was tried. Segmenting such documentsinto the component two parts would render them usable in knowledgereuse frameworks such as Case-Based Reasoning. This segmentationproblem presents a hard case for traditional text segmentationdue to the lexical inter-relatedness of the segments. We developa two-part segmentation technique that can harness a corpusof similar documents to model the behavior of the two segmentsand their inter-relatedness using language models and translationmodels respectively. In particular, we use separate language modelsfor the problem and solution segment types, whereas the interrelatednessbetween segment types is modeled using an IBM Model1 translation model. We model documents as being generated startingfrom the problem part that comprises of words sampled fromthe problem language model, followed by the solution part whosewords are sampled either from the solution language model or froma translation model conditioned on the words already chosen in theproblem part. We show, through an extensive set of experiments onreal-world data, that our approach outperforms the state-of-the-arttext segmentation algorithms in the accuracy of segmentation, andthat such improved accuracy translates well to improved usabilityin Case-based Reasoning systems. We also analyze the robustnessof our technique to varying amounts and types of noise and empiricallyillustrate that our technique is quite noise tolerant, anddegrades gracefully with increasing amounts of noise",
author = "Deepak Padmanabhan and Karthik Visweswariah and Nirmalie Wiratunga and Sadiq Sani",
year = "2012",
language = "English",
pages = "793--802",
booktitle = "21st ACM International Conference on Information and Knowledge Management, CIKM'12, Maui, HI, USA, October 29 - November 02, 2012.",

}

Padmanabhan, D, Visweswariah, K, Wiratunga, N & Sani, S 2012, Two-part Segmentation of Text Documents. in 21st ACM International Conference on Information and Knowledge Management, CIKM'12, Maui, HI, USA, October 29 - November 02, 2012.. pp. 793-802, CIKM 2012, Maui, United States, 29/10/2012.

Two-part Segmentation of Text Documents. / Padmanabhan, Deepak; Visweswariah, Karthik; Wiratunga, Nirmalie; Sani, Sadiq.

21st ACM International Conference on Information and Knowledge Management, CIKM'12, Maui, HI, USA, October 29 - November 02, 2012.. 2012. p. 793-802.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Two-part Segmentation of Text Documents

AU - Padmanabhan, Deepak

AU - Visweswariah, Karthik

AU - Wiratunga, Nirmalie

AU - Sani, Sadiq

PY - 2012

Y1 - 2012

N2 - We consider the problem of segmenting text documents that have atwo-part structure such as a problem part and a solution part. Documentsof this genre include incident reports that typically involvedescription of events relating to a problem followed by those pertainingto the solution that was tried. Segmenting such documentsinto the component two parts would render them usable in knowledgereuse frameworks such as Case-Based Reasoning. This segmentationproblem presents a hard case for traditional text segmentationdue to the lexical inter-relatedness of the segments. We developa two-part segmentation technique that can harness a corpusof similar documents to model the behavior of the two segmentsand their inter-relatedness using language models and translationmodels respectively. In particular, we use separate language modelsfor the problem and solution segment types, whereas the interrelatednessbetween segment types is modeled using an IBM Model1 translation model. We model documents as being generated startingfrom the problem part that comprises of words sampled fromthe problem language model, followed by the solution part whosewords are sampled either from the solution language model or froma translation model conditioned on the words already chosen in theproblem part. We show, through an extensive set of experiments onreal-world data, that our approach outperforms the state-of-the-arttext segmentation algorithms in the accuracy of segmentation, andthat such improved accuracy translates well to improved usabilityin Case-based Reasoning systems. We also analyze the robustnessof our technique to varying amounts and types of noise and empiricallyillustrate that our technique is quite noise tolerant, anddegrades gracefully with increasing amounts of noise

AB - We consider the problem of segmenting text documents that have atwo-part structure such as a problem part and a solution part. Documentsof this genre include incident reports that typically involvedescription of events relating to a problem followed by those pertainingto the solution that was tried. Segmenting such documentsinto the component two parts would render them usable in knowledgereuse frameworks such as Case-Based Reasoning. This segmentationproblem presents a hard case for traditional text segmentationdue to the lexical inter-relatedness of the segments. We developa two-part segmentation technique that can harness a corpusof similar documents to model the behavior of the two segmentsand their inter-relatedness using language models and translationmodels respectively. In particular, we use separate language modelsfor the problem and solution segment types, whereas the interrelatednessbetween segment types is modeled using an IBM Model1 translation model. We model documents as being generated startingfrom the problem part that comprises of words sampled fromthe problem language model, followed by the solution part whosewords are sampled either from the solution language model or froma translation model conditioned on the words already chosen in theproblem part. We show, through an extensive set of experiments onreal-world data, that our approach outperforms the state-of-the-arttext segmentation algorithms in the accuracy of segmentation, andthat such improved accuracy translates well to improved usabilityin Case-based Reasoning systems. We also analyze the robustnessof our technique to varying amounts and types of noise and empiricallyillustrate that our technique is quite noise tolerant, anddegrades gracefully with increasing amounts of noise

M3 - Conference contribution

SP - 793

EP - 802

BT - 21st ACM International Conference on Information and Knowledge Management, CIKM'12, Maui, HI, USA, October 29 - November 02, 2012.

ER -

Padmanabhan D, Visweswariah K, Wiratunga N, Sani S. Two-part Segmentation of Text Documents. In 21st ACM International Conference on Information and Knowledge Management, CIKM'12, Maui, HI, USA, October 29 - November 02, 2012.. 2012. p. 793-802