Types of Information Retrieval (IR) Databases
Types of indexes
The NISO technical report (Anderson 1997a) identifies more than 30 types of indexes used for information retrieval. Because indexes are so central to IR databases, influencing as they do the methods for the representation of messages, texts and documents on the one hand and the methods for searching and retrieval on the other, these types of indexes correspond to types of IR databases. They are listed here, with examples. The intent of the NISO technical report, and of this book, is to address design principles that apply to every kind of index and IR database that is intended to describe messages, texts and documents and to provide access to them for subsequent retrieval.
Attributes of IR databases, of indexes
Like any complex entity, IR databases and their indexes can be categorized by many different attributes. The major ones are:
● the kinds of objects represented in index terms, headings, and entries;
● the kinds of index terms used;
● the kinds of indexable matter used for indexing;
● the methods for presenting the index to the user and the concomitant method for searching made available to the user;
● the arrangement of entries;
● the methods for analysis of message content;
● the methods for term selection for indexing;
● the methods for term combination in index headings;
● the methods for term combination in searching;
● the kinds of documents being indexed;
● the medium of the IR database;
● the proximity of the documents being indexed to the IR database itself;
● the size of documentary units;
● the periodicity of the IR database;
● the authorship of the database.
The types of IR databases and indexes listed below will mention many complexities that haven’t been explained yet. After all, most of the book is yet to come. So don’t worry. The purpose of this list is to emphasize the wide scope of IR database and index possibilities. It can also be used for reference, later on, simply to review some of the choices available in IR database and index design. So the first time through, just scan it, and don’t worry about the details.
Here is this complex list laid out, one criterion at a time, with some explanation and with some examples of real, existing IR databases.
1.5.1. Kinds of Objects Represented in Index Terms, Headings, and Entries.
Indexes to authors, topics, features
The major categories of objects represented in the terms, headings, and entries of indexes are the persons and organizations responsible for the creation of messages, texts, and documents, and the topics and features of these messages, texts, and documents.
Iindexes to authors, illustrators, editors, translators, publishers
a. indexes to persons and organizations responsible for messages, texts, and documents:
i. author indexes.
ii. illustrator indexes.
iii. editor indexes.
iv. translator indexes.
v. publisher indexes.
Indexes to composers, choreographers, lexicographers, painters, sculptors
Depending on the nature of messages, authors can be writers, composers (of music), choreographers (of dance), lexicographers (of dictionaries), painters, sculptors, etc.
Indexes to subjects, places, institutions, documents, laws, quotations, Bible verses
b. indexes to topics addressed in messages and texts.
i. general subject indexes.
ii. specialized indexes to types of subjects, such as places, persons, institutions, operations, and documents (e.g., laws, quotations, Bible verses), etc.
Indexes to features
c. indexes to features of messages, texts, and documents.
Indexes to titles
i. title indexes.
Indexes to genres, science fiction, novels, fiction, short stories, poems
ii. genre indexes, e.g., an index to science fiction novels or short stories or poems.
Indexes to document numbers, international standard numbers
iii. document number indexes, e.g., an index to ISBNs (international standard book numbers).
Note: The author of a message and its text is perhaps its most important feature, so category 1.5.1.a could have been subsumed under this more general category — but persons and institutions responsible for documents get their own category because they are so important.
1.5.2. Kinds of Terms Used.
Index terms usually consist of words, but they can also consist of numbers of various types and also other types of specialized symbols.
Role of words in index terms
a. word indexes.
Word indexes can be further categorized by the types of words, e.g.,
Role of proper nouns, common words in index terms
i. proper nouns — names of persons, corporate bodies, places.
ii. common words
Role of numbers in index terms
b. numerical indexes.
Role of symbols in index terms
c. indexes using specialized symbols
Role of mathematical symbols in index terms
i. mathematical symbols.
Role of chemical symbols in index terms
ii. chemical symbols.
Role of musical symbols in index terms
iii. symbols representing music.
1.5.3. Kinds of Indexable Matter Used.
Full text as basis for indexing
a. indexes based on the full text of documentary units.
b. indexes based on summaries of documentary units, e.g.,
Titles as basis for indexing, title indexes
i. indexes based on titles only.
Abstracts as basis for indexing
ii. indexes based on titles and abstracts.
c. indexes based on portions of documentary units, e.g.,
Lead paragraphs as basis for indexing
i. lead paragraph only.
Tables of contents as basis for indexing
ii. tables of contents only.
Introductory matter as basis for indexing
iii. introductory matter.
Reference citations as basis for indexing, citation indexes
iv. reference citations (for citation indexes).
First lines as basis for indexing
v. first lines (as in poems).
1.5.4. Presentation and Methods for Searching.
There are two fundamentally different ways that IR database indexes can be searched: (1) visual scanning and examination of index headings, and (2) mechanical or electronic symbol comparison and matching. (It is also possible to create Braille indexes that are scanned by touch and audible indexes that are listened to, but the first two approaches are the major ones.) The first method is performed by humans. The second method is now performed by computer algorithms. (Prior to the computer, various mechanical means were devised for comparison and matching.) For the first method, the index must be displayed for human visual inspection. For the second method, the user does not necessarily see the index. Some of the best IR designs will combine these two approaches, so that users can take advantage of sophisticated electronic machine matching algorithms but can also see displays of index headings when they wish to browse or make some preliminary judgments about documents or the direction of a search. (Here the focus is on methods of searching. An IR database that provides only for electronic machine matching, with no display of indexes, will still display the results of a search for human examination and consideration!) So we have IR databases that provide:
a. displayed indexes for visual searching.
b. non-displayed indexes for searching by means of computer matching algorithms.
c. both types of indexes, for both types of searching.
1.5.5. Arrangement of Entries.
Presentation of IR databases; internal computer representation not addressed
Non-displayed indexes may have internal arrangements to facilitate computer comparison and matching, but this book does not address these internal computer issues. The methods and techniques for internal electronic representation and manipulation are constantly changing, and their mastery requires expertise and experience separate from that required for high quality design of IR databases from the point of view of their presentation to and use by human users. The focus of this book is on the presentation of IR databases and their indexes to users. Many different computer methods can be used for the same type of presentation.
Arrangement of displayed indexes
So here, we focus on the arrangement of displayed indexes — those indexes designed for human visual scanning and inspection.
Such indexes must have an order that facilitates the location of particular entries. Here are the choices:
Alphanumeric arrangement of displayed indexes
a. alphabetic or alphanumeric indexes. At first glance, this is a simple category, and a very popular one for indexes, but as discussed above in section 1.4 on standards, there is no agreement on what constitutes proper alphabetic or alphanumeric order. Consequently, there are many different approaches and versions. These shall be taken up in detail later in section 17.1 on alphanumeric displays.
Relational arrangement of displayed indexes
b. logical, relational or classified indexes. Here, headings are arranged according to various types of relationships among the concepts represented. Criteria for such arrangements can be increasing or decreasing importance, chronology, class inclusion (creating hierarchies from broad topics to narrow ones), or a whole and its parts. These arrangements are often called “classified,” but this term tells you nothing about the basis of the arrangement, especially because the classes represented by index headings can also be arranged alphabetically. Relational arrangements will be discussed in some detail later on in section 17.3.
Alphabetical-relational arrangement of displayed indexes
c. combined alphabetical-relational indexes. Some arrangements combine aspects of alphabetical and relational criteria. They are sometimes called “alphabetico-classed.” One approach is to arrange broad classes in alphabetical order, with subordinate classes arranged under broad classes on the basis of various relational criteria. The opposite approach is also used. Broad classes are arranged on the basis of relational criteria, but narrower, subordinate classes may be arranged in alphabetical order. The Library of Congress classification uses this latter approach quite frequently.
1.5.6. Methods for Analysis.
As with the arrangement of entries, there are two fundamentally different approaches to the analysis of messages for indexing, with a third approach combining elements of the two basic approaches. Thus we have:
Human intellectual analysis of texts for indexing, human indexing
a. indexes based on human intellectual analysis of messages and texts.
Computer algorithmic analysis of texts for indexing, automatic indexing
b. indexes based on various computer algorithms for the analysis of machine-readable texts. This is often called “automatic indexing.”
Combination of automatic indexing and human indexing
c. indexes based on combinations of computer and human analysis.
1.5.7. Methods for Term Selection.
Index terms can be extracted from texts (if the texts consist of words) or they can be assigned to texts. Extractive indexes are most often associated with automatic computer-based indexing, but human indexers can also limit their selection of terms to those appearing in language texts. Assignment indexing is done most often by human indexers, but computer algorithms also have been developed to assign terms not found in texts. Thus we have:
Extraction of index terms
a. indexes based on extracted terms.
Aassignment of index terms
b. indexes based on assigned terms.
Combination of extraction and assignment of index terms
c. indexes based on both the extraction and the assignment of terms.
1.5.8. Methods for Term Combination.
Necessity for combination of index terms
Indexes must provide the capability to search for multiple topics or features at the same time. If indexes provided access to only one topic or feature at a time, they would be pretty worthless. Can you imagine searching a large database for everything related to “United States,” with no capability of combining that term with anything else that you want?
Methods for combination of index terms
There are two basic types of methods for the combination of terms, and these are correlated with whether the index is displayed or non-displayed. Thus we have:
Precoordinate combination of index terms
a. precoordinate term combination for indexes that are displayed — terms are combined (or coordinated) before the index is presented to the user for searching.
Postcoordinate combination of index terms
b. postcoordinate term combination for indexes that are non-displayed — terms are combined (or coordinated) after access to the index is presented (via a search interface) to the user, at the time of the search.
Precoordinate and postcoordinate combination of index terms; information science as example of bound term
c. indexes based on both precoordinate and postcoordinate terms. Precoordinate terms are often used in non-displayed indexes to represent complex concepts and to prevent the inaccurate or inappropriate combination of discrete terms. (For examples, see the discussions of pre- and postcoordinate syntax in section 1.3 on terminology.)
1.5.9. Kinds of Documents Being Indexed.
Here, IR databases are characterized not on the basis of their own features, but on the basis of the types of documents that are included or represented and indexed for the database. These are as various as all the existing types of documents, and new types are being developed or invented all the time. Only some representative examples are listed here:
a. periodicals: articles in periodicals or whole periodicals (complete sets); also specialized forms of periodicals or serials, such as newspapers, newsletters, etc.
b. books and monographs, including “back-of-the-book” indexes for single books.
d. fiction; also specialized types of fiction, such as science fiction, romance, historical novels, mysteries, fantasy, short stories.
e. film: motion pictures and other types of film or photographic media (such as slides, filmstrips, photographs).
f. video; video recordings.
g. pictures: reproductions, paintings, drawings, photographs, etc.
h. maps of all types, two-dimensional, three-dimensional; flat maps and charts; globes; geographical information systems.
i. music and sound documents, including all sorts of sound recordings — spoken, music, and other types of sounds, such as bird songs, animal sounds, weather sounds, etc. — on various media. Also musical scores.
j. machine-readable texts.
k. computer software.
l. internet; including world-wide web resources.
IR databases for periodicals
periodicals: articles in periodicals or whole periodicals (complete sets); also specialized forms of periodicals or serials, such as newspapers, newsletters, etc.
IR databases for books, monographs
books and monographs, including “back-of-the-book” indexes for single books.
1.5.10. Media of IR Databases.
The media of IR databases are as varied as the media of documents in general — after all, IR databases are documents too. The major media used for IR databases are:
a. paper as medium for IR databases . Before the development of paper, IR databases were recorded on its precursors, such as stone and clay tablets, parchment and other animal skins, papyrus, tree bark and other vegetable matter. Paper media includes card-stock, which was the most popular medium for library catalogs for about a century, until electronic media became viable and popular.
b. microforms as media for IR databases. IR databases have appeared in various styles of microfilm and microfiche.
c. electronic media for IR databases. This broad category includes an ever increasing variety of formats, such as compact discs (CDs, CD-ROMs), larger optical disks, magnetic disks and tape, as well as online databases maintained in accessible computer media and of course websites.
d. sound media for IR databases. Spoken indexes sometimes accompany sound collections and archives. These are similar to those ever more pervasive voice mail menus that confront you when you call many offices and agencies. Sound indexes can be especially useful for persons with visual impairments.
e. braille media for IR databases. Braille is usually recorded on paper, but because it is a specialized combination of symbols for persons with visual impairments, it gets a separate listing.
1.5.11. Proximity of Documents Being Indexed.
a. full-text databases. Full-text databases contains the full text of the documents to which it points. This includes books published with traditional back-of-the-book indexes, as well as the increasingly popular full-text electronic IR databases, ranging from newspaper and periodical databases to encyclopedias and other reference works of various sorts and digital libraries. If you are surprised to find the printed book with index in this category, just remember that here too, the index is combined with the full text of the document being indexed, so it qualifies!
b. reference databases. Reference databases provide access to documents that are not included in the database. Instead, the IR database provides some sort of locator, such as a bibliographic citation and possibly a call number or notation that can be used to obtain the full document from a library collection, publisher, the internet, or other distributor or document delivery service.
A library catalog may be seen as a reference database that refers to items in the library’s collection. On the other hand, the library as a whole, including its catalog, may be considered a full-text database, because the documents to which the catalog refers are within its collections (unless they are checked out!).
1.5.12. Types and Sizes of Documentary Units.
Here we categorize IR databases and their indexes with respect to the kind of documentary units (parts of documents, complete documents, collections of documents) that are analyzed for retrieval. These units depend, of course, on the type of document. We give examples mostly from language documents, but analogous examples could be given from visual image documents (photographs, paintings), moving image documents (films, videos), sound documents, etc. In the past these units were often called “bibliographic units,” because they were described in bibliographies.
definition of bibliography
In this book, we have subsumed the term “bibliography” in the broader, newer term “IR database,” but “bibliography” and “bibliographies” are fine old words that mean writing (graphy) about books (biblio), thus they have come to mean lists and descriptions of books. There is no reason to limit their meaning to “books,” because the “biblio” part of the word comes from the Greek for papyrus leaves! So by extension, bibliographies can deal with messages and texts in any format and medium, just as IR databases can and do.
a. IR databases for small documentary units (parts of complete documents), such as lines, sentences, paragraphs, and pages, or frames in a videotape, segments of pictures or maps). These indexes lead the user inside the full document. Sometimes such small units are referred to as “information units” because they are more likely to lead directly to a precise message that may answer the searcher’s query.
b. IR databases for complete documents, e.g., periodical articles, chapters in collections, papers in conference proceedings, stories and poems in anthologies, and monographs.
c. IR databases for collections of documents, e.g., anthologies; complete sets of periodicals, serials and series; archives; libraries, etc.
1.5.13. Periodicity of IR Databases.
a. monographic databases. Like any document, an IR database can be a monograph — a one-time publication, sometimes called a “closed-end” database or index.
b. serial databases. Or an IR database can be designed for updating on a regular or irregular basis. These databases are sometimes called “continuing” or “open-end” databases or indexes.
1.5.14. Authorship of IR Databases.
Finally, IR databases can be categorized by authorship, whether an IR database has been created by one or a small number of individuals who can be named and credited with its creation or by a large organization, with the participation of many persons, so that the personal influence of individual authors is not apparent. IR databases relying on automatic indexing are created, in part, by machine algorithms, but human beings “authored” the algorithms that are used.
1.5.15. Continuing Examples.
examples of IR database design
Throughout this book, design principles related to the topic of each chapter will be applied to three prominent types of IR databases — (1) a book or monograph with its own index (often called a back-of-the-book index); (2) an indexing and abstracting service for a scholarly discipline; and (3) a full-text encyclopedia, which can be seen as a digital library of messages and texts.
monographs as examples of IR databases/h3>
For the example of a single book as an IR database, indexes will be designed for both electronic and print media. The index at the end of this book illustrates the implementation of the design for the print-medium index.
indexing and abstracting services as examples of IR databases/h3>
The example of a scholarly indexing and abstracting service will be an indexing and abstracting service for the literature of library and information science. Every reader of this book likely has some familiarity, or at least interest, in these disciplines.
full-text encyclopedias and digital libraries as examples of IR databases/h3>
The example of a full-text encyclopedia (or digital library) will be an IR database consisting of digital texts on library and information science.