The case for auto-classification

A Stream of Auto-Classification Consciousness by Randolph Kahn, ESQ, Information Nation (Feb 29)

Why not do more auto-classification of data? Lawyers do for e-discovery cases. People are notoriously poor at indexing content – (unless they are trained indexers we might hastily add)

“We have empirical data to support the proposition that employees classify and code information way worse than computers, by a long shot. Yet most companies continue to rely on their employees to manage information. “[T]echnology-assisted process, in which only a small fraction of the document collection is ever examined by humans, can yield higher recall and/or precision than an exhaustive manual review process, in which the entire document collection is examined and coded by humans.”

Classification for managing information

Get Your Data Under Control with Automated Content Categorization (PDF), AIIM, (May 2011)

White paper from Recommind. Classification is still the method for managing data but manual can’t handle the high volumes and automated has errors. Recommind claims to have a solution.

“Today CIOs have choices beyond the manual and automated approaches, such as Recommind Decisiv™ Categorization. As part of Recommind’s suite of data management tools, Decisiv Categorization offers automated classification through supervised learning.
The technology effectively leverages the knowledge of human beings to teach technology to better automate classification.” –

Developing Taxonomies
Ontologies, Taxonomies, Thesauri: Learning from Texts, by Christopher Brewster and Yorick Wilks; Department of Computer Science, University of She±eld, She±eld, UK (2004)

On automating the creation of taxonomies:

“This paper takes the approach that, given the `info-smog’ we live in (AKT 2001), hand-crafting is impractical and undesirable. While it is still a major research challenge to construct ontologies entirely automatically, the current tools available from the Natural Language Processing community make it possible to automate the task to a large extent and reduce manual input to where it makes the most qualitative di®erence. In Section 2, we describe discuss in greater detail the problem of manually constructing ontologies and argue for the use of text corpora as the main source of knowledge. In Section 3, we present a number of criteria as a guide for the method that need to be used for the automation of ontology construction. In Section 4, we present a number of methods for constructing ontologies from texts based on the criteria presented. Section 5 considers how to bridge the gap between the implicit knowledge assumed by a given text and the actual explicit knowledge present in the texts.”

Concept searching

Getting to the point, KMWorld (Mar 18, 2009)

Sharepoint, taxonomies, and automatic classification seem to come together in ConceptSearching

“Content Types can be used to enforce metadata governance, adhere to policies and drive workflows in line with business processes. Included in the new release is the ability to assign taxonomies to specific Content Types. Documents that correspond to the selected Content Types will be classified and documents that do not correspond to a content type or do not include some metadata elements that a specific content type has specified will not be classified.”

Learn more about ConceptSearching’s conceptClassifier and its capabilities for metadata generation, taxonomy classification, and taxonomy creation at

Page also links to case studies such as Microsoft’s Two Page Partner Case Study

More about auto-classification

Automatic Classification: A Panel Discussion FUMSI (Jan 2009) — “Karen Loasby discusses automatic classification with freelance information architect Helen Lippell and BBC information architect Silver Oliver.”

Panelists covered a lot of ground in this discussion: types of auto classification systems (2), problems the English language present, taxonomies and folksonomies, and situations in which auto-classification is suitable and when not.

“Taxonomies can be the glue of an automatic classification implementation. They are the vocabulary that rules, whether Boolean or statistical, are built upon, allowing concepts to be applied consistently to content. Taxonomies also provide the framework of relationships, such as synonyms and related terms between concepts – they help the automatic system to understand the domain in the way that users do.”







Leave a Reply

Your email address will not be published. Required fields are marked *