?? ch32.htm
字號:
<HTML><HEAD><TITLE>Chapter 32 -- How Intranet Search Tools and Spiders Work</TITLE><META></HEAD><BODY TEXT="#000000" BGCOLOR="#FFFFFF" LINK="#0000EE" VLINK="#551A8B" ALINK="#CE2910"><H1><FONT SIZE=6 COLOR=#FF0000>Chapter 32</FONT></H1><H1><FONT SIZE=6 COLOR=#FF0000>How Intranet Search Tools and SpidersWork</FONT></H1><HR><P><CENTER><B><FONT SIZE=5><A NAME="CONTENTS">CONTENTS</A></FONT></B></CENTER><UL><LI><A HREF="#HowIntranetSearchToolsWork">How Intranet Search Tools Work</A></UL><HR><P>Corporate intranets can contain an almost unimaginable amountof information. Departments, divisions, and individuals createa wide variety of Web pages, both for internal and external consumption.Human resource information, personnel handbooks, procedures manuals,and newsletters are all posted internally. Databases-both thosehosted directly on the intranet and on "legacy" databaseson non TCP/IP systems-are available. Add that to all the informationthat can be gotten via the Internet using the World Wide Web,and you have a serious case of information overload.<P>There are several ways to help intranet users find the informationthey need. One way is to create subject directories of intranetdata that present a highly structured way to find information.They let you browse through information by categories and subcategories,such as marketing, personnel, sales, research and development,budget, competitors, and so on. In a Web browser, you click ona category, and you are then presented with a series of subcategories,such as East Coast Sales, South Sales, Midwest Sales, and WestSales. Depending on the size of the subject directory, there maybe several such layers of subcategories. At some point, when youget to the subcategory you're interested in, you'll be presentedwith a list of relevant documents. To get those documents, youclick on links to them. On the Internet, Yahoo is the most well-known,largest, and most popular subject directory. <P>Another popular way of finding information-and in the long runfor intranets, probably more useful-is to use search engines,also called search tools. Search engines operate differently fromsubject directories. They are essentially massive databases thatindex all the information found on the intranet-and can includeinformation found on the Internet as well. Search engines don'tpresent information in a hierarchical fashion. Instead, you searchthrough them as you would a database, by typing in keywords thatdescribe the information you want. <P>Intranet search engines are usually built out of three components:An <I>agent</I>, <I>spider</I>,<I> </I>or <I>crawler</I> thatcrawls across the intranet gathering information; a <I>database</I>,which contains all the information the spiders gather; and a <I>searchtool</I>, which people use as an interface to search through thedatabase. The technology is similar to Internet search enginessuch as Alta Vista.<P>Intranet search tools differ somewhat from their Internet equivalents.The database of information they search can be built not justby agents and spiders searching Web-based pages. Agents can bewritten that can go into existing corporate databases, extractdata from them, and put them into the database of searchable information.And people on an intranet can fill out forms and submit theirinformation into the database as well. Additionally, since theyare built for a specific corporation and its data, the informationthey gather and the way they are searched can be customized.<H2><A NAME="HowIntranetSearchToolsWork"><FONT SIZE=5 COLOR=#FF0000>How Intranet Search Tools Work</FONT></A></H2><P>Searching and cataloging tools, sometimes called search engines,can be used to help people find the information they need. Intranetsearch tools, such as agents, spiders, crawlers, and robots, areused to gather information about the documents available on anintranet. These search tools are programs that search Web pages,extract the hypertext links on those pages, and automaticallyindex the information they find to build a database. Each searchengine has its own set of rules guiding how documents are gathered.Some follow every link on every page that they find, and thenin turn examine every link on each of those new home pages, andso on. Some ignore links that lead to graphics files, sound files,and animation files; some ignore links to certain resources suchas WAIS databases; and some are instructed to look primarily forthe most popular home pages.<OL><LI>Agents are the "smartest" of the tools. They cando more than just search out records: They can per-form transactionson your behalf, eventually such as finding and ordering the lowest-fareairline ticket for your vacation. Right now they can search sitesfor particular recordings and return a list of five sites, sortedby the lowest price first. Agents can cope with the context ofthe content. Agents can find and index other kinds of intranetresources, not just Web pages. They can also be programmed toextract records from legacy data-bases. Whatever information theagents index, they send back to the search engine's database.<LI>General searchers are commonly known as spiders. Spiders reportthe content found. They index the information they find and extractsummary information. They look at headers and at some of the linksand send an index of the information to the search engine's database.There is some overlap between the tools-spiders can be robots,for example.<LI>Crawlers look at headers and report first layer links only.Crawlers can be spiders.<LI>Robots can be programmed to go to various link depths, compilethe index, and even test the links. Because of their nature, theycan get stuck in loops, and they take consider-able Web resourcesgoing through the system. There are methods available to preventrobots from searching your site.<LI>Agents extract and index different kinds of information. Some,for example, index every single word in each document, while othersindex only the most important 100 words in each; some index thesize of the document and number of words in it; some index thetitle, headings and subheadings, and so on. The kind of indexbuilt will determine what kind of searching can be done with thesearch engine, and how the information will be displayed.<LI>Agents can also go out to the Internet and find informationthere to put in the search engine's database. Intranet administratorscan decide which sites or kinds of sites the agents should visitand index-for example, competitors to the corporation or newssources. The information is indexed and sent to the search engine'sdatabase in the same way as is information found on the intranet.<LI>Individuals can put information into the index by fillingout a form about the data they want put in. That data is thenput into the database.<LI>When someone wants to find information available on the intranet,they visit a Web page and fill out a form detailing the informationthey're looking for. Keywords, dates, and other criteria can beused. The criteria in the search form must match the criteriaused by the agents for indexing the information they found whilecrawling the intranet.<LI>The database is searched, based on the information specifiedin the fill-out form, and a list of matching documents is preparedby the database. The data-base then applies a ranking algorithmto determine the order in which the list of documents will bedisplayed. Ideally, the documents most relevant to a user's querywill be placed highest on the list. Different search engines usedifferent ranking algorithms. The database then tags the rankedlist of documents with HTML and returns it to the individual requestingit. Different search engines also choose different ways of displayingthe ranked list of documents-some just provide URLs; some showthe URL as well as the first several sentences of the document;and some show the title of the document as well as the URL.<LI>When you click on a link to one of the documents you're interestedin, that document is retrieved from where it resides. The documentitself is not in the database or on the search engine site.</OL><HR><CENTER><P><A HREF="ch31.htm"><IMG SRC="PC.GIF" BORDER=0 HEIGHT=88 WIDTH=140></A><A HREF="#CONTENTS"><IMG SRC="CC.GIF" BORDER=0 HEIGHT=88 WIDTH=140></A><A HREF="contents.htm"><IMG SRC="HB.GIF" BORDER=0 HEIGHT=88 WIDTH=140></A><A HREF="ch33.htm"><IMG SRC="NC.GIF" BORDER=0 HEIGHT=88 WIDTH=140></A><HR WIDTH="100%"></P></CENTER></BODY></HTML>
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -