PUMA publications for /user/benz/languagehttps://puma.uni-kassel.de/user/benz/languagePUMA RSS feed for /user/benz/language2024-03-19T14:31:44+01:00N-Gram-Based Text Categorizationhttps://puma.uni-kassel.de/bibtex/26922ef3ab653ff35cbe9117227816a24/benzbenz2011-02-04T16:10:18+01:00identification language text_categorization <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="William B. Cavnar" itemprop="url" href="/author/William%20B.%20Cavnar"><span itemprop="name">W. Cavnar</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="John M. Trenkle" itemprop="url" href="/author/John%20M.%20Trenkle"><span itemprop="name">J. Trenkle</span></a></span>. </span><span itemtype="http://schema.org/Book" itemscope="itemscope" itemprop="isPartOf"><em><span itemprop="name">Symposium On Document Analysis and Information Retrieval</span>, </em></span><em>Seite <span itemprop="pagination">161--175</span>. </em><em>Las Vegas, </em>(<em><span>1994<meta content="1994" itemprop="datePublished"/></span></em>)Fri Feb 04 16:10:18 CET 2011Las VegasSymposium On Document Analysis and Information Retrieval161--175N-Gram-Based Text Categorization1994identification language text_categorization Text categorization is a fundamental task in document processing, allowing the automated handling of enormous streams of documents in electronic form. One difficulty in handling some classes of documents is the presence of different kinds of textual errors, such as spelling and grammatical errors in email, and character recognition errors in documents that come through OCR. Text categorization must work reliably on all input, and thus must tolerate some level of these kinds of problems. We describe here an N-gram-based approach to text categorization that is tolerant of textual errors. The system is small, fast and robust. This system worked very well for language classification, achieving in one test a 99.8 % correct classification rate on Usenet newsgroup articles written in different languages. The system also worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject, achieving as high as an 80 % correct classification rate. There are also several obvious directions for improving the system's classification performance in those cases where it did not do as well. The system is based on calculating and comparing profiles of N-gram frequencies. First, we use the system to compute profiles on training set data that represent the various categories, e.g., language samples or newsgroup content samples. Then the system computes a profile for a particular document that is to be classified. Finally, the system computes a distance measure between the document's profile and each of theSpeech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (Prentice Hall Series in Artificial Intelligence)https://puma.uni-kassel.de/bibtex/225110e6691b5ee9dbe97216ce087487f/benzbenz2011-02-04T16:10:02+01:00computer lecture computational processing nlp language natural dspp linguistics science <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Daniel Jurafsky" itemprop="url" href="/author/Daniel%20Jurafsky"><span itemprop="name">D. Jurafsky</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="James H. Martin" itemprop="url" href="/author/James%20H.%20Martin"><span itemprop="name">J. Martin</span></a></span>. </span><em><span itemprop="publisher">Prentice Hall</span>, </em><em><span itemprop="bookEdition">1</span> Edition, </em>(<em><span>2000<meta content="2000" itemprop="datePublished"/></span></em>)<em>neue Auflage kommt im Frühjahr 2008.</em>Fri Feb 04 16:10:02 CET 20111neue Auflage kommt im Frühjahr 2008Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (Prentice Hall Series in Artificial Intelligence)2000computer lecture computational processing nlp language natural dspp linguistics science Amazon.com: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (Prentice Hall Series in Artificial Intelligence): Daniel Jurafsky,James H. Martin: Books