legalthesaurus.org

Google and Indexes

In a Google dominated online world, indexing is still important but as information professionals we need to debate the virtues of codes and controlled versus uncontrolled indexing, as well as the challenges of bibliographic searching versus fulltext searching. The business of indexing also needs to be addressed – how can indexing time lags be minimised, how can costs be contained? What are the innovative indexing initiatives of major players such as Factiva, Derwent, Dialog, and others? How successful are tools such as mapping, weighting algorithms, proximity operators and qualifiers, and clustering search engines? Do metatags solve the problems of and reveal the Invisible Web? What are the keys to success for human indexers in a fulltext search engine world?

A PASSIONATE SEARCHER’S VIEW OF INDEXING AND INDEXERS

On a hot summer’s day in Melbourne in the late 1960’s I was doing a literature search for one of the scientists using several tools – Biological Abstracts, Index Veterinarius, Veterinary Bulletin, and so on. The CSIRO Animal Health Research Laboratory library in the Veterinary Precinct at the University of Melbourne had wooden shelves to the ceiling, wooden ladders, shiny, highly polished vinyl tiles. I climbed to the top of the ladder to “retrieve” a large blue bound volume of Biological Abstracts from the top shelf. The bindings had stuck in the heat and I had to use some force to separate the volume from its neighbours. The force not only removed the volume but it also moved the ladder! As I tumbled to the floor on top of the ladder with my skirt over my head, a startled research scientist who had been snoozing in the large leather chair woke up and said “I know the librarian is alive as I can hear muffled swearing emerging from under her skirt”!

Not long after that, the National Library announced a series of MEDLARS workshops – the batch predecessor to MEDLINE – and I was keen to attend and learn how I could use a computer to do literature searches. Betty Doubleday the CSIRO Librarian at the time said “Bah! There will never be a role for computers in libraries” and so my odyssey as “a passionate searcher” began.

Not being convinced the Chief Librarian was right, I left CSIRO, flew to Europe and killed a few birds with one stone by getting a job in continental Europe, with a database producer (Embase) in Amsterdam – close enough to the ski slopes of Europe and so on.

I thought I needed to learn how to program but a wonderful mentor Dr. Pierre Vinken, at the time Director of Excerpta Medica, advised me there were many other things other than programming I needed to learn and in a shortish stint at Excerpta Medica (10 months in all) I was introduced to many things that I doubt I could have learned in a library.

I learned about:

· The politics of database production and the competition between MEDLINE and EMBASE and the results of different indexing philosophies at the commercial Excerpta Medica and government National Library of Medicine. At EM the indexing terms were chosen by subject specialists and supported by a massive thesaurus called MALIMET. At NLM the indexing terms were chosen by librarians supported by MESH headings.

· The importance of secondary terms and codes.

· The difference between abstracting and index services, and between manual sources and electronic sources. In fact, I was instructed to write a letter to the editor of the British Journal of Industrial Medicine explaining the differences as the difference between a laboratory animal and a human!

· The value of overlapping services – even if they are competitors – BIOSIS, EMBASE, MEDLINE etc.

· The business of database production – marketing, negotiating, pricing, staffing etc.

· How Utopian and unattainable a universal indexing language is when I wrote the Dutch submission UNISIST and attended the UNISIST meeting at UNESCO in 1972.

On return to Australia, I reluctantly accepted a job in the library at ACI in November 1972 – reluctant because I liked the business side of our information world – but ACI was intriguing. In 1972 the library had its own database, skilled staff in a business library in Melbourne and a technical library in Sydney, a budget of $250,000, and access to a strong computer department that was later to develop AUSINET. Our own internal database then called LISARD, an acronym (chosen by the Systems Librarian Dagmar Schmidmaier) for Library and Information Service Automated Retrieval of Data was a bibliographic database with abstracts. [1]

Between 1972 and 1975 having seen STAIRS demonstrations in Australia and Europe, I recommended to the ACI computing department that they lease the STAIRS programs and offer external services as well as internal services, but my suggestions were rejected. In 1975 I had also seen Dialog in operation at Shell in The Hague and late in 1975 I introduced online searching to ACI with access to Dialog and Orbit and a year or two later Finsbury’s Textline which later became Reuters Business Briefings. By the end of the 1970’s ACI Computer Services introduced AUSINET using STAIRS and we in the library eventually were able to move our internal database and even our catalogue to the STAIRS platform! So my progress as a passionate searcher” and now a database producer continued.

In 1988 ACI was taken over. I was made redundant but I was invited by the State Library of NSW to establish Information Edge as a joint venture. I bought the LISARD database, renamed it EDGE, and became an Information Broker. In 1996 I acquired the State Library’s shares in the business and so I not only became a passionate searcher but I now became a passionate searcher with a business to maintain.

I don’t intend to bore you with a biography. Instead I would like to address a number of issues that I feel strongly about as a passionate searcher and I would like to explore issues that I think as information professionals we need to debate. I intend to address:

The role of indexing and indexes today in a fulltext world. Why is indexing still important, even in a Googled world? Searching for high recall versus high precision.
The value of controlled versus uncontrolled indexing, the use of codes, the value of weighting, the need to be able to limit searches easily to date ranges, the challenges of bibliographic searching and fulltext searching
The business of indexing, how timely, how cost efficient
How some of the big players index? Factiva? Promt? Derwent? And closer to home databases available on Informit? How valuable is mapping?
Experiences using weighting tools and proximity operators and qualifiers versus indexes
Internet searching – Google, clustering search engines, the Invisible Web, metasearch engines
The role of indexing and indexes today in a fulltext world. Why is indexing still important, even in a Googled world? Searching for high recall versus high precision.

The Google spiders index whatever comes their way – whether they are peer reviewed or not, whether the documents are about Uncle Bat’s fishing trip, or a CSIRO fisheries experiment, and the spiders cannot type so they do not venture too far into many sites with rich contents- especially databases and other dynamic content. The same can be said of other search engine spiders. So already we have some key advantages of human indexers – they can type, they do concentrate usually on information resources with some quality. But more than that, human indexers can evaluate the material they are indexing and with that evaluation provide clues to searchers as to what they will find when they retrieve a document.

When considering searching, it is important I believe to recognise that there are different types of searches. Some require high precision – every article must be highly relevant – others require high recall – i.e. we will take some irrelevant articles as long as we can be sure we have found absolutely everything. These latter searches are vital for so called “prior art” searches – searches used in intellectual property type scenarios. Then there are searches that require a few inspiring articles say for a lecture or articles to measure media coverage where both high precision and high recall are needed over a recent time frame. Searching therefore at a professional level is not simply a question of “plug and play” – plug a couple of terms into a search box and accept the results.

The value of controlled versus uncontrolled indexing, the use of codes, the value of weighting, the need to be able to limit searches easily to date ranges, the challenges of bibliographic searching and fulltext searching

What sort of indexing are we talking about? Are we talking about Thesaurus controlled indexing? And if so how complicated is the Thesaurus? Is it a multi-volume type job like the LC subject headings? Is it a small list like the old Economic Literature Index – 9 pages arranged in 2 alphanumeric columns?

Big professional databases such as Chemical Abstracts, Engineering Index, MEDLINE, Embase, and BIOSIS all use controlled indexing. Big professional online hosts such as Factiva use controlled indexing. Some of the News Limited databases do too. But others do not. And Google does not. I wonder if the advent of online fulltext searching means that the value of controlled indexing is diminishing? Which searchers for instance can afford to purchase a copy of every relevant thesaurus? Many systems these days – Informit, Factiva, Dialog – allow cross file searching so there are even dangers relying on one thesaurus that may not be relevant for the files being searched.

Being able to date limit to me is absolutely essential. And by date limiting I am not interested in knowing when a record was indexed. I am interested in knowing when that information was first published or first amended. It is no different between knowing when a book was published or a new edition published as opposed to when it was catalogued or reprinted. It is disappointing to me when I need to search Australian databases (or WWW sites) when I find there is no provision to be able to identify easily new information. I am amazed how often I am unable, especially with collections of Australian data on the Web, to determine when the information was first released – so there is this fuzziness of how current or how obsolete some information actually is.

Of course in recent years there has been a huge growth in fulltext databases. And one of the issues for indexers is – how can you reveal in an index both major and minor concepts of importance? Let’s say there is an article about “hot filling of PET containers” – that is the major concept – but there is also a minor mention – but significant to some that there is a “waisted bottle capable of hot fill that has been made for Coca Cola”. How in an index, is one able to show the difference emphases of these two concepts?

Personally I am also a great believer in the use of codes. Of course we use codes when we catalogue a library collection but the classification systems are used to arrange physical collections. In a database such as COMPENDEX or DERWENT or PROMT and many others CODES can be enormously powerful search aids.

The business of indexing, how timely is the indexing, how cost efficient?

To me the key value of indexing is to lead me to material I may not find easily any other way. But I am often looking for new information in a business competitive world. What is being published about competitive products? What are the business relationships between companies or people? What is known about a particular science? Certainly I sometimes do work of an historical nature – how have certain practices evolved? But very, very rarely am I asked to do searches of a totally time insensitive nature. I would say that 97% of my searches involve seeking very recent information. So immediately this raises the issue of timeliness of indexing.

It frustrates me enormously that some Australian databases are appallingly out of date with their indexing. It seems to be a pattern that follows the well known “cataloguing backlog” which I personally was familiar with in my CSIRO days. But when many files on the major commercial providers are loaded several times daily, such as the newswires, then manual indexing of online databases that are 6 months or more delayed, is clearly totally unsatisfactory to most searchers and in my view simply should not be tolerated. I would go as far as saying such out of date indexing is unprofessional.

This also raises the issue of cost efficiencies. How much should it cost to index articles for a bibliographic database? Because labour is the key cost of indexing, this really boils down to how quickly an article can be indexed and how quickly will boil down to the procedures to be used. At one end of the spectrum is a haphazard “Pick a couple of terms” to the thorough detailed index of a service such as AESIS for stance. AESIS was an earth sciences database created by the Australian Mineral Foundation. It had an excellent, thorough well constructed thesaurus and in addition to subject and author index terms it also had geographic coordinate indexing so one could search the database for records relating to a geological deposit in a certain region of Australia. There is no question this was a thoroughly professional database with excellent consistent indexing.

But also it would appear that assuming a base grade librarian could perform the indexing at a salary of say $40,000, with salary associated costs of about 33% this equals $52,000 or $37.14 per hour cost. [2] This means that for a database adding say 200 records per month at one per hour, the cost of indexing alone would be $7,500 and for a year $90,000 for the indexing alone. Add the cost of the primary source material, computing, marketing, copyright fees etc this makes the cost quite significant requiring a sufficiently strong market to justify those costs. And this is a problem I believe for Australian databases. We simply do not have a big enough market to make such services economically viable.

Our experience indexing the EDGE database in Information Edge has been different. We have never, even when part of ACI, been able to afford complex detailed indexing. We currently expect an average of 3 articles per hour to be indexed. This is quite hard for some types of articles – some of the longer and more obscure management articles. But for other trade type articles experienced indexers can complete 6 or more an hour. But indexing rules have been specifically designed to

· make the database workable while

· being economically viable.

Briefly we use a combination of the following:

a) Subject codes , e.g. financial management = 041, strategic management = 042,

HR management =060 and so on.

b) Indexers are instructed to choose any terms to represent the major concepts of an article and supplementary terms not already covered in the abstract that may be useful for retrieval, e.g. BEER MARKET SHARE and STATISTICS

c) A database manual outlines the indexing philosophy

d) Indexers are not permitted to use abbreviations unless they are in a 4 page list of controlled terms, e.g. AUST, ROI, WWW. This list is periodically reviewed and updated lists of approved abbreviations released.

WE do not have a complex thesaurus and I acknowledge this is not as professional as databases like MEDLINE or AESIS. However MEDLINE has a huge global market and can afford to spend more on indexing than we can. EDGE has managed to be profitable even with a small customer base.

So it seems to me timeliness and cost efficiency are both goals that should be paramount.

How do some of the big players index? Factiva? Promt? Derwent? And closer to home databases available on Informit? How valuable is mapping?

PROMT when owned by Predicasts was one of my favourite databases to use. I found it a fascinating and rewarding database to use largely because of the quality of its indexing. PROMT originally was an acronym for Predicasts Overview of Markets and Technology and Predicasts had taken the SIC[3] codes as a foundation and developed their own detailed classification system for products, markets and technologies and secondary attributes including countries (e.g. CC=8AUST = Australia), events (e.g. EC=6 = market data and trends, EC=01 = forecasts, trends, outlook) and the codes cascaded.

Because the codes were quite detailed it sometimes was very easy to achieve a high recall and high precision with a simple search statement such as:

s pc=2076266 and ec=6 and cc=8aust and py=1999:2003

to answer the question

What market statistics are available on sesame seed oil in Australia for the years 1999-2003

PROMT now belongs to Gale, and as a long time user of the database it seems the costs of indexing the database have been reduced by limiting the indexing and it is not as powerful as it once was. Codes, I am told, no longer cascade and must be truncated. Dialog, Gale, Informit are database hosts that offer access to individual files. For both Dialog and Informit, it is possible to carry out multifile searching. Both services have made attempts to standardise field names but clearly cannot standardise the indexing terms. Dialog takes multifile searching further and offers to REMOVE DUPLICATES if the titles are identical. So searching for instance MEDLINE, EMBASE, BIOSIS may result in say 300 records. But after removing duplicates one may end up with a final set containing say 150 records. Quite impressive.

But even so in constructing the searches one has to be careful if using the controlled indexing terms to make sure the idiosyncracies of each of the 3 thesauri are taken into consideration. Factiva takes a different approach and while it is distributing a lot of data that is also distributed by Dialog (and other services) in individual files, Factiva extracts the individual records and places them in source groupings that can be searched together.

From 2000-2002, I was a member of the Factiva Information Professional Advisory Board which met once a year for those 3 years. This was fascinating for me as Factiva addressed the issues of blending the Reuters Business Briefings service with Dow Jones Interactive (DJI). I was a critic of RBB because I found the indexing almost useless and the search engine so crude that only very limited search statements could be used. So I frequently downloaded hundreds and hundreds of articles to visually browse through to find the material I needed. DJI on the other hand had evolved from a STAIRS based system and had a wonderful search engine with which I once constructed a search statement with about 20 lines of text!

The management team at Factiva under the leadership of the CEO Clare Hart is impressive and very responsive to the needs of information professionals. Indeed Factiva employs several information professionals in key roles. The term “intelligent indexing” was adopted by Factiva to describe the approach Factiva would take to indexing in the new child to be born of the RBB/DJI joint venture. During discussions about this I was anxious as I was critical of the RBB indexing, and I dreaded losing the functionality of the DJI search algorithm. Has it been a success?

The search engine is fine in terms of functionality. At this stage I will be equivocal about the indexing. Recently I was searching for information about “innovative and successful corporate event management”. My client is an event management company that specialises in managing events in the corporate sector. While their business is related to conference and exhibition management, similar to conference organisers that help manage ALIA conferences, their business is not dealing with public events but with private events within the corporate sector. For this search I could not use the Factiva indexing as I found the indexing was neither comprehensive enough nor was it consistent enough. Here are a few examples that illustrate the inconsistent indexing and how some indexing terms were used for articles containing different concepts.
Save ResultsFactivaDow Jones & Reuters

HDTHINK TANK – CORPORATE EVENTS – In praise of parties.

BYBy Emma Reynolds.

WC1,709 words

PD31 July 2003

SNMarketing Event

SCMAREVE

PG16

LAEnglish

CY(c) Marketing Event, a Haymarket publication www.haymarketgroup.com, for more information visit www.brandrepublic.com or email info@brandrepublic.com

LP

When the going gets tough, corporate events are often the first activities to be cut from the marketing budget. Bad move. Our expert panel tells Emma Reynolds why.

The corporate event industry received a welcome boost in April’s budget when Chancellor Gordon Brown doubled the tax-free limit on individual spend at corporate parties to £150.

NS

ncal: Calendar of Events | ncat: Content Types | nrgn: Routine General News

PUB

Haymarket Business Publications Ltd.

AN

Document MAREVE0020030731dz7v0001c

This article is clearly a lengthy article about CORPORATE EVENTS being useful for MARKETING within the CORPORATE sector but note the index terms above?

Â(c) 2003 Dow Jones Reuters Business Interactive LLC (trading as Factiva). All rights reserved. FactivaDow Jones & Reuters

HDSPECIAL REPORT VENUES – Long-distance learning.

WC1,530 words

PD12 June 2003

SNMarketing Week

SCMKTW

PG35

LAEnglish

CY(c) 2003 Centaur Communications Limited or its licensors.

LP

The benefits to be had from taking staff out of the work environment for training and debriefings still far outweigh the financial outlay – especially as organisers are finding new ways to get the most from your money, says Ian Whiteling

There’s a major project on the horizon and staff need to be briefed and trained to make sure it’s delivered efficiently and effectively. Under the current tough economic conditions, there’s a tendency for companies to prepare staff for such a job in house or close to the workplace to avoid time away from the office and unnecessary expenditure. This seems a logical step to take, but is it really a false economy? Would staff respond better away from the distractions of the office?

IN

i8395411: Convention/Trade Shows | iadv: Advertising/Public Relations

NS

gtour: Travel | gcat: Political/General News

IPC

United Kingdom

PUB

Centaur Communications Ltd.

AN

Document MKTW000020030616dz6c00027

Here is another lengthy article that is clearly about how successful corporate training events can be. Note the indexing terms.

Â(c) 2003 Dow Jones Reuters Business Interactive LLC (trading as Factiva). All rights reserved. FactivaDow Jones & Reuters

HDBusiness Day (South Africa) – Massive growth in industry.

BYBy David Jackson.

WC549 words

PD30 April 2003

SNBusiness Day (South Africa)

SCMEWBUD

PG15

LAEnglish

CY(c) 2003 Chamber World Network International Ltd

LP

Massive growth in industry Events can link consumers to a brand or company, writes SUCCESSFUL mega events such as the 2003 Cricket World Cup, with the international television exposure they bring, have underlined the dramatic growth of sponsorships and events management as an essential ingredient in the overall marketing mix.

The events management industry has mushroomed in SA and the growth in sponsorship spend has increased dramatically over the past two decades.

NS

c31: Marketing | c32: Advertising | ccat: Corporate/Industrial News |

gcat: Political/General News | gspo: Sports/Recreation | ncat: Content

Types | nrgn: Routine General News

IPC

Company News | General News | Marketing | Sponsorship | Sports | 71 |

71121 | 711211 | South Africa | Sub-Saharan Africa

PUB

Financial Times Information Ltd

AN

Document MEWBUD0020030501dz4u0001z

This is not a very lengthy article but it appears to discuss one successful type of corporate event – that of using a sporting fixture to help with branding. Note the indexing terms.

Â(c) 2003 Dow Jones Reuters Business Interactive LLC (trading as Factiva). All rights reserved. FactivaDow Jones & Reuters

HDEvent Management – What it Means to Make your Conference or Seminar

Delegate Driven.

BYBy Glenn Baker.

WC1,551 words

PD2 April 2003

SNManagement Magazine

SCMANAGM

PG48

LAEnglish

CY(c) 2003 Profile Publishing Limited

LP

Staging a successful conference or seminar involves more than just applying the latest technology. It’s about having a clear objective, an appropriate venue, expert help, and being delegate driven. Nothing motivates quite like a well-organised get-together involving key leaders. In his book Jack, former General Electric CEO Jack Welch highlights the importance that his Crotonville management seminars played in sharing ideas and catalysing the success of the multi-billion dollar

company.

NS

C31: Marketing | C315: Conferences/Exhibitions | CCAT:

Corporate/Industrial News | NCAT: Content Types | NEDC: Commentary/Opinion

RE

AUSNZ: Australia and New Zealand | NZ: New Zealand

RBBCM

HEADCNTY:NEW ZEALAND | INX:LOW | C::4I8395411:H::01 | C::3C31:L::04 |

BIPAZ | TOTTED

AN

Document managm0020030412dz4200060

For the first time the index terms CORPORATE, CONFERENCES/EXHIBITIONS, MARKETING have been used for a lengthy article that appears to discuss how corporate events were used inside General Electric.

Â(c) 2003 Dow Jones Reuters Business Interactive LLC (trading as Factiva). All rights reserved. FactivaDow Jones & Reuters

HDEvent Measurement Conference Slated; Intensive One-day Event for

Marketing Pros

WC518 words

PD25 October 2002

ET00:06

SNBusiness Wire

SCBWR

LAEnglish

CY(Copyright (c) 2002, Business Wire)

LP

FRAMINGHAM, Mass.–(BUSINESS WIRE)–Oct. 23, 2002–Successful Shows announces the Event Measurement Conference, to assist marketing professionals with the techniques and strategies to measure the performance of face-to-face marketing efforts. EMC will be held on Wednesday, Dec. 4 at the Sheraton at Woodbridge Place in Iselin, N.J.

Skip Cox, a faculty member and president of Exhibit Surveys Inc. of Red Bank, N.J., says the Event Measurement Conference is a high impact, full-day seminar for sales and marketing executives, trade show and event managers, marketing communications managers, and meeting and event planners to gain insight and hands-on tools so that they can make informed decisions about trade show and event marketing participation.

CO

EXHSUR: EXHIBIT SURVEYS INC

IN

IACC: Accounting/Consulting | IBCS: Business/Consumer Services | ICNSL:

Consulting

NS

C31: Markets/Marketing | C315: Conferences/Exhibitions | CCAT:

Corporate/Industrial News

RE

NAMZ: North American Countries | USA: United States | USE: Northeast U.S.

| USNJ: United States – New Jersey

DJIC

NND | CLT | IAFR | BW | PREL | MRK | ENGL | USE | NJ | NME | US

DJID

Newswire End Code | Consulting Services | Business Services: Consulting Firms | Business Wire | Press Release Wires | Marketing | English language

content | Northeast U.S. | New Jersey | North America | United States

AN

Document bwr0000020021024dyao00a73

This article however does not appear to be relevant as it appears to be more about an event measurement for marketing managers rather than corporate event management. Note the index terms are the same almost as hose for corporate events in GE.

Â(c) 2003 Dow Jones Reuters Business Interactive LLC (trading as Factiva). All rights reserved. FactivaDow Jones & Reuters

HDStage Managed … or how to deliver an outstanding event.

BYBy Claudia Tasker.

WC3,030 words

PD7 September 2002

SNManagement Magazine

SCMANAGM

PG34

LAEnglish

CY(c) 2002 Profile Publishing Limited

LP

With a current growth rate of 16 percent annually, New Zealand’s convention and incentives industry has never looked healthier. But the trick is to be more than a statistic and deliver a memorable event. That takes good management. Here’s how to go about it. Not even the events of September 11 affected the figures and the flow of business personnel heading here to conference last year. They contributed more than $260 million to our economy.

NS

C31: Markets/Marketing | C315: Conferences/Exhibitions | CCAT:

Corporate/Industrial News

RE

AUSNZ: Australia and New Zealand | NZ: New Zealand

RBBCM

HEADCNTY:NEW ZEALAND | INX:LOW | C::4I8395411:H::01 | C::3C31:L::03 |

BIPAZ | TOTTED

AN

Document managm0020021107dy970002b

Similarly this is more about marketing event management companies rather than corporate event management. Again the same indexing terms have been used for this article as have been used for events for marketers and the GE story.

Â(c) 2003 Dow Jones Reuters Business Interactive LLC (trading as Factiva). All rights reserved.

The indexing can be examined for consistency this way:
Relevant
Calendar of events, content types, routine general news
Relevant
Convention/Trade Shows, Advertising/Public Relations
Relevant
Marketing, Advertising, Corporate/Industrial News, Political/General News, Sports/Recreation, Content types, Routine General News
Relevant
Marketing, Conferences/Exhibitions, Corporate/Industrial News, Content types, Commentary/Opinion
Not relevant
Markets/Marketing, Conferences/Exhibitions, Corporate/Industrial News
Not relevant
Markets/Marketing, Conferences/Exhibitions, Corporate/Industrial news
From this table you can see that 3 articles had common indexing terms, but 2 of those articles were not relevant. Of the relevant articles there was very little consistency in the indexing terms.

So how useful is the Factiva intelligent indexing? In my experience when there is an indexing term that exactly matches the concept of a search – e.g. Government policies relating to the IT industry and specific countries – the Factiva indexing can be useful.
But in my view, although I believe Factiva is using both human indexing and automated indexing, Factiva has other tools that are much more useful to me given that most files in Factiva contain fulltext data. The indexing in my view suffers from the same problem that other indexes have – i.e. inconsistent application of indexing terms and too much reliance on indexing “terms” rather than indexing the “concepts”. In my view indexers need to ask the question “What is this article really about?” and then aim to index that concept consistently, rather than looking for potentially useful terms – such as the South African article above which attracted terms such as “Political/General News”.

It would be unwise to leave this section without commenting on Derwent. Patents are notoriously difficult to search and those who are not experienced may make the mistake for example of searching for a descriptor term such as REFILLABLE/DE AND PACKAGING/DE assuming those terms would be fairly safe to use. Wrong! The correct way would be to search for REFILL/DE AND (PACK OR PACKAGE)/DE Also experienced searchers would aim to use codes as well as approved thesaurus terms.

Experiences using weighting tools and proximity operators and qualifiers versus indexes

While I may give Factiva 5/10 for indexing, I give it 10/10 for other search aids. In full text files one of the indexing challenges is to represent the significance of certain terms and concepts. In a bibliographical file, one would not normally use as an indexing term BHP for an article that describes a traffic accident that occurred outside BHP House. But when searching fulltext files simply inserting the term BHP will retrieve that article. So the challenge in searching fulltext files is to locate articles that are focussed on say BHP and not those articles that mention BHP in passing.

Factiva has retained most of the power of the DJI search capability by permitting searchers to qualify search terms to particular fields – e.g. Headline (title), Leading paragraph, Indexing fields and to use proximity operators EVENT ADJ MANAGEMENT or MANAGEMENT NEAR2 EVENT$1. The ADJ operator can be used with numbers 1-10, NEAR can be stretched to 500 characters, WITHIN (W/N) can be used for within the same sentence, and SAME can be used to mean within the same paragraph. But more than that, Factiva allows ATLEAST (1-50) which I find extremely useful. So I was able to set up a search string of:

(CONFERENCE$1 OR EVENT$1 OR MEETING$1) NEAR4 MANAGEMENT NEAR4 (CORPORATE OR BUSINESS OR COMPANY OR COMPANIES) AND (ATLEAST4 EVENT OR ATLEAST4 CONFERENCE$1 OR ATLEAST4 MEETING$1)

Here I am looking for articles that deal with these concepts:

CONFERENCE OR EVENT OR MEETINGS MANAGEMENT in the CORPORATE SECTOR or in a COMPANY or in COMPANIES or in the BUSINESS sector and the articles should focus much more on events or conferences or meetings that on the other search terms.

Factiva also has brilliant mapping technology that enables the mapping of index terms taken from some files (e.g. Asia Pulse) to the terms used by Factiva.

Dialog also has some brilliant mapping technology that is especially useful when searching for information on drugs. One can map all terms relating to a chemical registry number for example and retrieve articles that refer to those chemicals as trade names or as chemical names or synonyms. The registry numbers are unique numbers and so clearly this powerful technology is enormously useful for high precision and high recall retrieval of anything to do with specific drugs.

Internet searching – Google, clustering search engines, the Invisible Web, metasearch engines

Having been an online searcher for so long, and having felt totally comfortable setting up quite complex search strategies in several systems, I experienced certain dismay, perhaps shock, when in 1996 it became necessary to search the Internet. It seemed that anarchy and chaos surrounded me. No longer could I control the frequency of search terms. No longer could I qualify search terms to specific index fields.

But now just a short 7 years on, it is fascinating to see the evolution of the search engines. Of course like everyone else I use Google but I still use phrase searching, I still use Boolean operators and field qualifiers in my searches using the Advanced screen and I find it intriguing to be able to limit searches for instance to PDF or PPT files in the EDU domains. How fabulous to be able to pop into a search box: BIOAVAILABILITY PHARMACOKINETICS FILETYPE: PPT SITE: EDU and find various explanations from US pharmacy educator in Powerpoint slides or to replace the SITE: EDU with SITE: AC.UK and just as easily see UK explanations.

But clearly there are exciting new developments on the horizon with the new search engines TEOMA and VIVISIMO. TEOMA has the fascinating features of providing suggested links and suggestions for narrowing the search. VIVISIMO on the other hand appears to have the full range of Boolean logic including nesting and it magically clusters the results into logical and conceptual groups.

It is a challenge however, to find the wealth of material now buried in the Invisible Web. In their Power Searching with the Pros workshops, Mary Ellen Bates and Chris Sherman advise to “Adopt a hunter mindset” and to be “opportunistic” to find useful material in the Invisible Web. It is no longer possible to use search engine boxes to find the wealth of material now available via the Web but not indexed by the search engine spiders. And yet we cannot afford to ignore this rich store of information estimated to be 50 times the size of the Visible Web excluding the proprietary database services now delivered on a web platform. All sorts of techniques have to be used – find special library maintained portals, datamine bookmarks, run mini searches to find sites that may have databases buried in them, use Invisible Web pathfinders and so on.

Metasearch engines in my view appear to be more powerful than they often are. Often they do not have their own indexes. But they are useful for quick overview type searches and to find appropriate terminology.

Conclusion

My career as an information professional has now spanned over 40 years. I became a passionate searcher when I discovered research librarianship in special libraries 40 years ago. I saw my first online searching experiment in the winter of 1971 – over 30 years ago and obtained my first Dialog and Orbit passwords in 1975, and I began producing and indexing a database in 1972.

Indexing, if done well, continues even today to be extremely valuable. But if too slow, too finicky, or too elaborate, in my view it is useless.

My challenge to indexers is to remember your end-users and what they need – they need you:

to alert them quickly – this means indexing must not take too long – quality and speed must be balanced
to identify the key concepts in the item you are indexing – this means you must understand and evaluate the content you are indexing and weight your indexing in some way
to reveal how recent or how old the material is
to suggest the variety of terms that can be used to describe concepts – this means be a lateral thinker, think outside the square when selecting index terms
to indicate the type of data – this means use your expertise to reveal if the article you are indexing is an opinion piece, rich in data such as statistics, its geographic scope (if relevant) whether it is dense or novel and so on.
If, as indexers you can achieve those goals, to this day despite Google and the other clever clustering search engines, the metasearch engines and the mapping, ranking and weighting of some services, and the intelligent, automated indexing, despite any new innovations over the horizon – to this day in a fulltext era the best, fast, consistent, human indexers are real treasures!

September 2003

Information Edge

www.infoedge.com.au

[1] It is now called EDGE, belongs to Information Edge and is available online via Informit and the MEDGE subset is available on Business Australia on Disc.
[2] 200 working days a year @7 hours per day assuming 100% productivity per day
[3] Standard Industrial Classification

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *