Discriminative Topic Segmentation of Text and Speech
the Sim algorithm can be viewed as the supervised coun-
terpart of cosine-distance based algorithms such as Text-
Tiling. Hence, it is significant that SV yields a substantial
improvement over both HTMM and Sim.
6 Conclusion
In this paper, we gave two new topic segmentation algo-
rithms for speech content based on a general measure of
topical similarity derived from word co-occurrence statis-
tics. The first algorithm functions by comparing adjacent
observation windows according to a similarity measure for
words trained on co-occurrence statistics. The second is
based on comparing compact geometric descriptions of the
adjacent windows in topic similarity feature space. We
have demonstrated both algorithms to be empirically ef-
fective. The support vector based algorithm significantly
and consistently surpasses in quality the segmentation pro-
duced by a hidden topic Markov model (HTMM). We have
demonstrated that in the presence of uncertainty resulting
from the use of a speech recognizer, topic segmentation al-
gorithms can be improved by using recognition hypotheses
other than that receiving the highest likelihood.
References
Christopher Alberti, Michiel Bacchiani, Ari Bezman, Ciprian
Chelba, Anastassia Drofa, Hank Liao, Pedro Moreno, Ted
Power, Arnaud Sahuguet, Maria Shugrina, and Olivier Sio-
han. An audio indexing system for election video material.
In ICASSP, Taipei, Taiwan, 2009.
Doug Beeferman, Adam Berger, and John Lafferty. Statistical
models for text segmentation. Machine Learning, 34(1-3):177–
210, 1999.
David M. Blei and Pedro J. Moreno. Topic segmentation with an
aspect hidden markov model. In SIGIR, pages 343–348. ACM
Press, 2001.
David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Laf-
ferty. Latent dirichlet allocation. Journal of Machine Learning
Research, 3, 2003.
Kenneth Ward Church and Patrick Hanks. Word association
norms, mutual information, and lexicography. Computational
Linguistics, 16(1):22–29, 1990.
Corinna Cortes and Vladimir Vapnik. Support-vector networks.
Machine Learning, 20(3):273–297, 1995.
Amit Gruber, Michal Rosen-Zvi, and Yair Weiss. Hidden topic
markov models. In AISTATS, San Juan, Puerto Rico, 2007.
Timothy J. Hazen and Anna Margolis. Discriminative feature
weighting using MCE training for topic identification of spo-
ken audio recordings. In ICASSP, pages 4965–4968, Las Ve-
gas, Nevada, 2008.
Marti A. Hearst. Texttiling: segmenting text into multi-paragraph
subtopic passages. Computational Linguistics, 23(1):33–64,
1997.
Xiang Ji and Hongyuan Zha. Domain-independent text segmen-
tation using anisotropic diffusion and dynamic programming.
In SIGIR, pages 322–329, 2003.
Junbo Kong and David Graff. TDT4 Multilingual Broadcast News
Speech Corpus. http://www.ldc.upenn.edu/Catalog/
CatalogEntry.jsp?catalogId=LDC2005S11, 2005.
Hideki Kozima. Text segmentation based on similarity between
words. In ACL, pages 286–288, Morristown, NJ, USA, 1993.
ACL.
Ian R. Lane, Tatsuya Kawahara, Tomoko Matsui, and Satoshi
Nakamura. Dialogue speech recognition by combining hi-
erarchical topic classification and language model switch-
ing. IEICE - Transactions on Information and Systems, E88-
D(3):446–454, 2005.
Igor Malioutov and Regina Barzilay. Minimum cut model for
spoken lecture segmentation. In COLING/ACL, pages 25–32,
Sydney, Australia, July 2006.
Mehryar Mohri, Pedro Moreno, and Eugene Weinstein. A new
quality measure for topic segmentation of text and speech. In
Interspeech, Brighton, UK, 2009.
Mehryar Mohri. Finite-state transducers in language and speech
processing. Computational Linguistics, 23(2):269–311, 1997.
Mehryar Mohri. Learning from uncertain data. In COLT, pages
656–670, 2003.
Martin F. Porter. An algorithm for suffix stripping. Program,
14(3):130–137, 1980.
Jeffrey C. Reynar. Statistical models for topic segmentation. In
ACL, pages 357–364, College Park, Maryland, 1999.
G. Riccardi, A. Gorin, A. Ljolje, and M. Riley. A spoken language
system for automated call routing. In ICASSP, pages 1143–
1146, Munich, Germany, 1997.
Gerard Salton and Christopher Buckley. Term-weighting ap-
proaches in automatic text retrieval. Information Processing
& Management, 24(5):513–523, 1988.
Bernhard Sch¨olkopf, John C. Platt, John Shawe-Taylor, Alex J.
Smola, and Robert C. Williamson. Estimating the support
of a high-dimensional distribution. Neural Computation,
13(7):1443–1471, 1999.
Mark Steyvers and Tom Griffiths. Probabilistic topic models.
In Thomas K. Landauer, Danielle S. McNamara, Simon Den-
nis, and Walter Kintsch, editors, Handbook of Latent Semantic
Analysis, pages 427–448. Routledge, 2007.
David M. J. Tax and Robert P. W. Duin. Support vector domain
description. Pattern Recognition Letters, 20(11-13):1191–
1199, 1999.
Masao Utiyama and Hitoshi Isahara. A statistical model for
domain-independent text segmentation. In ACL, pages 491–
498, 2001.
Vladimir Vapnik. Statistical Learning Theory. Wiley, 1998.
Eugene Weinstein. Search Problems for Speech and Audio Se-
quences. PhD dissertation, New York University, September
2009.
J.P. Yamron, I. Carp, L. Gillick, S. Lowe, and P. van Mulbregt.
Event tracking and text segmentation via hidden markov mod-
els. In ASRU, 1998.