Crawling Under the Hood

Danny Sullivan. Online. Volume 23, Issue 3. May/Jun 1999.

Anyone who owns a car knows that from time to time you need to pop the hood and see how things are working underneath. The same is true with search engines. There are a number of “under-the-hood” issues that affect the quality of results they return. Yet, it’s difficult for the casual observer to define the technology that powers each engine and, at the same time, keep track of new enhancements. Additionally, where do the services’ priorities lie? Are they aware of the professional searcher population and do they plan enhancements appropriate for a professional’s search style and expertise?

To help, this article will take you on a tour of the major under-the-hood issues relating to crawler-based search services. We’ll be looking at topics, such as size, freshness, coping with changing Web technology, and the challenge of pleasing professionals.

Does Size Matter?

The general Web public got a rude awakening last year to the fact that search engines fall well short of indexing every available page on the Web. The results of a study published in Science magazine (S.R. Lawrence and C.L. Giles, “Searching the World Wide Web.” Science 280 (April 1998): pp. 98100) showed that the best search engine (HotBot) covered only 34 percent of the 320 million pages the study estimated to exist, while the worst (Lycos) covered only 3 percent.

This wasn’t a new discovery. Web publishers have long known that search engines don’t gather everything from their sites. Further, anyone who knows simple math could compare the search engines’ reported sizes to much higher estimates of the Web’s size to see the gap. Nevertheless, the article in Science went off like a bombshell. Well over a hundred stories followed in the general media. As a result, it reawakened interest in just how big each of the search engines are, and how you might miss something by using a small one.

This is a valid concern. The more pages a search engine indexes, the more likely it will provide comprehensive results in response to specific queries or to those on obscure topics. Since these are the type of queries professionals are inclined to make, there is a tendency among them to assume that bigger is better. But for the average user, size may make little difference in broad queries, such as “travel” or “cars.” This audience, which is far larger than the professional audience, is crying out for relevancy over comprehensiveness.

“When you are talking to a librarian or professional, they want recall. They want to find everything. They want every remote reference to the subject, and they will decide what’s relevant. But to the vast majority of people on the Web, that’s not the problem. Precision is the issue,” said Louis Monier, AltaVista’s chief technical officer. Audience reaction bears Monier’s comment out. The Science article made a big splash, but users have continued to make Lycos a leading search service despite its weak showing in the study.

The interest in relevancy over comprehensiveness is a leading reason why most of the search services haven’t made a bigger effort to substantially increase index size. Most of the leading services say that doing so will cost money and consume resources they’d rather invest in improving results. “Our system itself could scale to any size. It’s really a matter of the benefit to the consumer in relation to the cost,” said Kris Carpenter, Excite’s search product manager. Her statement was fairly typical of the other leading services.

There’s validity to putting relevancy over size, but some growth in search engine indexes is overdue. The Web has been growing and will continue to do so. Yet over the past year, only AltaVista, Inktomi, and Northern Light have increased their sizes significantly. Not surprisingly, these are wellknown favorites among professionals. They all have self-reported sizes that exceed 100 million Web pages.

Northern Light says it has ambitious plans to increase its index site, but the other front-runners don’t plan to double or triple their indexes as has been done in the past. They certainly don’t plan to list all the 320 million Web pages that the Science study estimated to exist.

“Will we see a 200 million Web page index in 1999? Probably,” said AltaVista’s Monier, whose service had a self-reported size of 150 million pages when this was written, “but don’t expect a big jump beyond that. My goal is really to have a more useful index.”

Currently, Inktomi powers several services, including HotBot and MSN Search, and has swapped crowns twice with AltaVista for the title of biggest search engine. Its index currently stands at 110 million, but the company has no plans to launch a new size offensive and reacquire the bragging rights to being biggest. “We’re not just interested in being able to advertise ourself as the biggest. We want to be the best,” said Troy Toman, Inktomi’s director of search services. Consequently, while the service says it could hit 200 million Web pages if it wanted, it is not something its customers are demanding. Instead, Inktomi is concentrating its efforts on improving relevancy and adding new features. “We’re spending most of our time figuring out how to deliver the best results,” Toman said.

In contrast, Northern Light says it plans to continue increasing its index substantially. Throughout 1998, the service ran third in the size race behind AltaVista and Inktomi. Early this year, it overtook Inktomi with a self-reported size of 120 million Web pages.

The service further claims it would rank number one, if self-reported numbers were audited. As support, it points to the long-standing survey of database size done by Greg Notess (www.notess.com/search/). In January 1999, Greg found Northern Light provided more comprehensive results than either AltaVista or Inktomi.

Concerns over inflated self-reported numbers may become moot if Northern Light’s own self-reported estimate exceeds AltaVista’s, as it expects will happen. The company says it is aiming to be both biggest and best. “We’re not completely changing our story. We’re still focused on providing the best results,” said Marc Krellenstein, Northern Light’s engineering director. “We’ve always focused on growing to get larger and larger. Our vision is to have everything, Web and nonWeb.”

Up-and-coming service Google also aims to be a size leader. The service is currently at 60 million pages, and cofounder Larry Page wants to go much higher. Page won’t specify exactly how high, but he gives every indication that he’d like to raise the benchmark well above the 100 million mark that currently separates the large search engines from the smaller ones. “We want to have the most comprehensive, highest quality search that is available,” he said.

Infoseek is one of the smaller services, having stayed at the 30 million Web page mark for over two years. But that’s beginning to change. The service increased its size to 45 million pages in concert with the January launch of its Go portal, and it plans to carry on with significant growth.

“You always hear AltaVista and HotBot say they are the biggest, and we definitely want to go after that,” said Jennifer Mullin, Infoseek’s director of search. “We’ve just developed some different technology in house, and we’ve been able to invest in the scaling issues.” So the index will grow, but Infoseek also echoes what all the others say: relevancy will be the top priority. “It’s quality first. We don’t want to flood the index with useless pages,” Mullin said.

Like Infoseek, Lycos has been one of the smaller services. It stayed at the 30 million Web page mark for about two years. It also suffered from placing last in the Science article estimates of search engine size. Also like Infoseek, Lycos has recently increased its size to a self-reported 50 million Web pages. But there, unlike Infoseek, it is not aiming to match or overtake the size leaders. “I don’t think we are going to grow significantly more beyond. That’s not our strategy,” said Lycos product manager Rajive Mathur. “We don’t necessarily believe a large index is a better index.”

Instead, Lycos says it will try to be more comprehensive by developing specialty search services. It’s a strategy that Excite is similarly pursuing. With a self-reported 50 million page index, Excite had occupied the middle ground between the search engine size leaders and the smaller services for over a year. Now Infoseek and Lycos have drawn level with it, but Excite has no plans to greatly enlarge its general index.

“We’re not hearing from our consumers that there is not enough information in this particular body,” said Excite’s Carpenter. “What we are hearing from them is ‘How can you give me a more specialized body.’“

In response, Excite is considering new indexes that catalog audio/video content, contain pages divided bv language, or other subject areas. That would not only provide more depth by topic, it would leave more room for non-topical pages in the main index.

Squeeze for Freshness?

It’s great to have a large index, but what good is it if the information is out-of-date? Search engines need to refresh their indexes and purge dead links to ensure their holdings are fresh and up-to-date.

In general, all the major services say they refresh their listings at least once per month. Many of them also update portions of their indexes more often. For example, each day AltaVista adds any new pages that are submitted and eliminates any dead links that a special “scrubber” spider has discovered. So some information within the index may only be a day old. AltaVista also runs a spider that revisits all the pages in its index and looks for new ones. The new and updated information from this spider is added to the index about every two to four weeks. Thus, in a worst-case scenario, information might be a month old.

Other services operate differently, but all of them say that their information should be no more than a month old. That’s acceptable, in my opinion. Most of the information on the Web is not date-sensitive. News stories are the exception, and when a searcher needs news-related material, they should turn to a news-specific service like Excite’s NewsTracker rather than a general-purpose search engine. These services crawl only news sites and update their indexes on a daily basis.

Freshness wouldn’t be an issue of concern, if everything worked as promised. But it doesn’t. For example, Northern Light, Inktomi, and Lycos all had serious freshness problems in 1998. Northern Light stopped spidering in the early part of the year, while Inktomi stopped adding new pages to its index for over a month in the latter half of the year. The Lycos problem was probably most severe-as late as November 1998, it was still using an index created from July 1998 crawls.

Other search services may have had freshness problems that went unnoticed. That’s the problem with freshness. The average user can’t squeeze a search engine like a loaf of bread to see if it’s stale or not. There’s no easy way for most people to determine freshness, but one helpful technique is to report a date the page was listed. Some do this already, such as AltaVista, Infoseek, Northern Light, and HotBot. Dates give a quick impression of how old an index is.

Dates aren’t a complete panacea, however. Some Web servers fail to report page modification dates, or they may report incorrect dates, such as those set in the future. In these cases, search engines usually default to using the date a page was spidered. Still, having some type of date information would go a long way toward helping users know whether an index is fresh or stale.

While all the search engines aim to have indexes no more than a month old, several of them work to keep some information much fresher than that. Excite uses a two-tiered strategy, where the most popular pages on the Web are updated on a weekly basis, while other pages are updated every two or three weeks.

Lycos also wants to target popular pages more frequently than others. It intends to use its WiseWire system, which allows users to vote on pages that they like, as a source for discovering URLs that should be spidered on a daily basis.

Infoseek takes a slightly different tack. It looks to content rather than popularity to determine how often its spider should revisit some sites. Those that change often, such as news or convention sites, get more frequent visits. Similarly, Inktomi plans to have its spiders learn the frequency that a page changes, so that revisits are done as necessary. Both services also update their indexes on a daily basis with any new finds or updated information, as does Northern Light. So in a rolling manner, all portions of their indexes are refreshed over the course of a month.

With all of these services, the index refresh isn’t intelligent in the sense that a spider compares the page to a copy within the index to see if the page has changed. AltaVista wants to change this.

In a new system it plans to launch this year, AltaVista would keep pages in its index in sync with those on the Web. A synchronization spider would look for changes, and only when they were spotted would a content spider be sent to retrieve a page. AltaVista’s Monier thinks this would allow his search engine to greatly increase its freshness by concentrating the spidering process on only the pages that require attention.

Coping with New Technology

While index size has been under the microscope, few are focusing on what may become a bigger problem, the growth of the “invisible” Web. These are pages that remain invisible to search engine spiders.

Frames

Frames are a classic example. Most search engines do not know how to crawl frames-based sites. As a result, important content goes undocumented. It is not uncommon to see a framesbased site of 100 pages or more be represented in some search engines by only its home page. Furthermore, the only content indexed from the home page is often “This site requires frames.”

AltaVista and Northern Light both understand frames, and so they are likely to have a slightly more representative collection of pages from across the Web than other services. Unfortunately, they’ve not made the additional step of reestablishing the original context of the frames. In other words, people may be taken from the listings onto a page designed to be viewed within a frame. When seen outside the frame, the page may not make sense and navigational links may not be present.

The other search services have only minor interest in improving their frames crawling capabilities. Some services may do spidering of framesbased sites on a case-by-case basis. But overall, solving the frames problem is not a top priority.

Dynamic Pages

Dynamically delivered pages present a similar barrier to spiders. These are pages that typically live within a database. The database hold the page’s main body copy, along with page headers, footers, and other common elements. When users click on a link, the database assembles the various pieces and delivers the finished product as one seamless Web page.

The hallmark of a dynamic Web page is the presence of a “?” in the URL. A typical link might look like this:

http://www.website.com/cgi-bin/ getpage.cgi?name=sitemap

Most search engines will not read past the “?”, resulting in an error and preventing the page from being indexed.

Search engines have a good reason for shunning these pages. They want to avoid what’s known as a “spider trap,” where they might be fed the same page thousands of times, under slightly different URLs. Since the “?” is a hallmark for a dynamic delivery situation, search engines use it as a stop sign warning them to go no further. But as with frames-based sites, this means that some sites using dynamic delivery options may have few or no pages represented within the search engines.

Dynamic delivery is a growing problem because databases are increasingly being used as Web authoring tools and the search engines do not have wide-ranging solutions to correct the problem, though most are now spidering selected sites on a case-by-case basis. “We’re definitely concerned about this, and one of the things we have done with publishers is to create back doors into their content,” said Excite’s Kris Carpenter. Other search engines had similar comments on the situation.

Google has an advantage here. It is designed to associate words around links with the pages that the links point to. So even if it can’t visit a dynamically delivered page, it may still know information about it based on the links to that page from other pages. It will also use this data to find pages that are so popular that they should be spidered regardless of any general bans against visiting dynamically generated pages.

XML

XML, the successor language to HTML, is another technology that search engines have not caught up with yet. The developers of XML believe that it will improve Web searching by establishing a new framework for delivering metadata. Database vendors and some Web site operators are already saying that the XML framework has made things easier for their internal searching needs. But the search engines are taking a wait-and-see approach to the XML tagging effort. All of them say they will support XML, but they offer a chorus of “ifs.” They’ll support the tags if standards emerge, and if the tags are commonly used, and most of all, if they feel they can trust the data.

Let’s deal with standards first. One proposal is that that the existing Dublin Core attributes would be transformed into an XML framework. Documents could be labeled by author, by publisher, by date, and with other metadata. In turn, this would allow for the type of fielded searches that research professionals are craving.

Of course, we’ve already had a standard for years, which the search engines haven’t supported. Dublin Core attributes can already be associated with documents in plain old HTML. But very few Web authors use these tags, so the search engines have ignored them.

Nor is it likely Web authors will use new tags unless search engines already support them. The most popular metatags currently used are the meta keywords and description tags. Almost all the major search engines support these, so many Web authors consider it worth the extra effort to use them. Since Dublin Core isn’t supported, and offers no ranking benefits, the majority of Web authors don’t bother with those tags.

“I think nobody is using them because they don’t bring an immediate benefit,” said AltaVista’s Louis Monier. “This kind of thing will only take off if there is some type of advantage.”

So it’s a chicken-and-egg situation. The search engines need to make a move to encourage Web authors, but they won’t do so because the Web authors themselves aren’t showing support for the tags. Moreover, what Web authors have repeatedly shown is that search engines cannot trust the metadata that authors provide. Many will miscategorize or inaccurately describe their pages, if they feel it will earn them a top spot in search results.

Even inadvertent miscategorization can occur. For example, my beta version of Word 2000 labels HTML documents with the date the page was created and with me as the author. But if my wife uses my computer and creates an HTML file, I’ll still be listed as the page author. And if my computer’s date is incorrect, the date tag will also be incorrect.

Overall, XML holds out promise, but it is unlikely to become reality in 1999.

Catering to Professionals

There are certain features that professional drivers want in a car that ordinary drivers wouldn’t consider. Likewise, professional researchers have features they want from search engines. To discover the pro searchers’ needs, I posted queries to the popular Web4Lib mailing list inviting feedback.

A top plea was for the ability to use Boolean commands and have them work consistently from one service to the next. It’s a strange plea because there’s already a great deal of consistency between the services. All of the major search services, with the exception of Infoseek and Google, support AND, OR, NOT, and nested searching. The main difference is that Excite requires the use of AND NOT rather than NOT. A more relevant concern is that the services may process nested queries differently.

The desire to use Boolean commands is popular because it is a format that many Web searchers are already familiar with. But with Web-based search engines, the use of the + and to require and exclude words, and the quotation marks to specify phrases, is in my opinion a better way to search than constructing Boolean queries. Moreover, these commands enjoy universal support with the major search engines, except Google.

Datasearch president Susan Feldman, an expert on both traditional research databases and Web search engines, holds similar views that Boolean is not necessarily the best technique for Web searching. “Most professionals have been trained on a Boolean system. They’ve raised Boolean queries to a high art, and they feel because they don’t understand how the search engines work, what they are doing is more advanced than throwing words into a search box,” she said.

Feldman is referring to the behind-the-scenes efforts search engines are already doing to improve relevancy. For example, enter a string of words and most search engines will naturally try to find them in close proximity to each other. This eliminates a need to specify a proximity command like NEAR. Likewise, do a search at AltaVista or Google, and they will automatically try to detect phrases in your queries and give you pages that contain those phrases. By entering a complex Boolean command, you are searching in a way the search engines are not designed for. This is not to say that Web search engines are not as advanced as traditional databases. Feldman explains that Boolean queries were originally used with traditional services because they lacked the processing power needed handle the more natural language style queries that can be done with today’s Web search engines. (See Susan Feldman’s article on Natural Language Processing in this issue.)

Another top desire among professionals was for more field search options, such as the ability to search by author. The problem here is that to have field searching, you need defined fields in your documents. Measures like Dublin Core are intended to do this, but as discussed earlier, there are a variety of implementation problems. As a result, field searching will likely remain restricted to things that the search engines can trust, such as the ability to search by page title. URL, or site. But not all search engines offer these options, and commands can be different between services. For example, restricting a search by domain at Inktomi is done using the “domain:” command, while AltaVista uses the “host:” format and Infoseek uses the “site:” command.

It’s not a difficult task to offer new basic commands or make field operators consistent. These developments haven’t happened because no one is pushing to unify ther Consequently, I started a search engine standards effort at the time of this writing, which hopefully will produce some more consistency between services.

Saved search functionality was another item on the wish list from professionals, and you can definitely expect to see this appear on several services in 1999. It shows every indication of becoming a standard feature on all of them, too-once one service adds a new feature the rest follow suit.

Revamping Relevancy

The common refrain from search engines is that their top priority is improving relevancy, and they’ve been very active on this front. The professional might not notice their efforts because most of the work has been aimed toward providing more relevant answers in response to the broad queries that average users make.

“We’re trying to take a big step with the problem of the vague query, the one or two word queries, where the text on the page has nothing to do with it,” said AltaVista’s Louis Monier. “The good answer to the query ‘car’ has nothing to do with the text. I’d rather use a medium with a crystal ball.”

Relevancy without looking at the text? Traditionally, search engines have relied on the location and frequency of search terms to determine relevancy. So far, no one (aside from Google) has dropped that method as dramatically as Monier suggests. But there has been a growing trend toward finding other factors beyond the words on the page to add to the relevancy mix. For instance, Direct Hit is a system that measures what pages users visit from search results. Pages that are actually visited gain rank, while those that are bypassed drop.

HotBot has been using Direct Hit data to provide users with alternatives to its search results, and by the time this appears, the service will probably have integrated Direct Hit data into its normal results. That means that the top pages you’re presented with will appear in part because HotBot’s general audience considers them to be the best pages on a topic.

In January 1999, Direct Hit was to debut a personalized search technology. This opens the door for search results that are refined in part by personal profiles. For example, a registered user living in the United Kingdom and searching for football would get different results than someone living in the United States. Those results would be influenced by the choices that all registered UK users make.

Lycos is considering something similar to Direct Hit. The company has a machine-compiled directory of Web pages categorized by topic. User feedback is used to rank sites within the categories. The company may use these ratings to help influence matches within its crawler-based search results.

Link data is also being used more. Pages with many links pointing at them, or links pointing at them from important sites, should rank better than those with fewer links. Links are seen as votes of quality.Google uses link data as a core element of its relevancy algorithm. Infoseek has been using it as part of its algorithm, as do Excite and AltaVista, to a lesser degree.

Human compiled directories have also seen a comeback-a lesson to learn from the popularity of Yahoo!, which remains by far the leading search service. Yahoo! depends on human editors to categorize Web sites.

LookSmart, a long-time Yahoo! competitor, has been adding to its listing staff and has made co-branding deals with AltaVista and HotBot. Snap is another directory competitor that has recently emerged.

Infoseek has greatly expanded its directory, plus sites listed in the directory also get a boost in its crawlerbased results. The theory is that if they’ve categorized a site within their rather selective directory, then it probably deserves a boost in the ranking algorithm. The system has made a noticeable improvement to Infoseek’s results.

Know Your Engine

The major search engines may be concentrating some of their enhancement efforts on the vague one- or twoword queries of the novice user, but for the information professional, knowing a little proper search syntax will go a long way-and this includes knowing when not to use Boolean. But the bottom line lies in understanding search engine capabilities. It’s not unlike knowing the pricing structures, output options, and content availability of the traditional information services. Understanding the innerworkings and mechanical tendencies of the various search engines and crawlers and keeping abreast of their enhancements—especially their ability to crawl frames, dynamic pages, and XML—will enable every searcher to hone in on the appropriate service and retrieve the best results. So pop the hood, take a look around, and go for a drive.