?? tutorial.html
字號(hào):
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>4. A quick guide to running your first crawl job</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix User Manual"><link rel="up" href="index.html" title="Heritrix User Manual"><link rel="prev" href="wui.html" title="3. Web based user interface"><link rel="next" href="creating.html" title="5. Creating jobs and profiles"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">4. A quick guide to running your first crawl job</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="wui.html">Prev</a> </td><th align="center" width="60%"> </th><td align="right" width="20%"> <a accesskey="n" href="creating.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tutorial"></a>4. A quick guide to running your first crawl job</h2></div></div></div><p>Once you've installed Heritrix and logged into the WUI (see above) you are presented with the web Console page. Near the top there is a row of tabs.</p><p><span class="bold"><strong>Step 1.</strong></span> <span class="emphasis"><em>Create a job</em></span></p><p>To create a new job choose the Jobs tab, this will take you to the Jobs page. Once there you are presented with three options for creating a new job. Select 'With defaults'. This will create a new job based on the default profile (see <a href="creating.html#profile">Section 5.2, “Profile”</a>).</p><p>On the screen that comes next you will be asked to supply a name, description and a seed list for the new job.</p><p>For a name supply a short text with no special characters or spaces (except dash and underscore). You can skip the description if you like. In the seeds list type in the URL of the sites you are interested in harvesting. One URL to a line.</p><p>Creating a job is covered in greater detail in <a href="creating.html" title="5. Creating jobs and profiles">Section 5, “Creating jobs and profiles”</a>.</p><p><span class="bold"><strong>Step 2.</strong></span> <span class="emphasis"><em>Configure the job</em></span></p><p>Once you've entered this information in you are ready to go to the configuration pages. Click the <span class="emphasis"><em>Modules</em></span> button in the row of buttons at the bottom of the page.</p><p>This will take you to the modules configuration page (more details in <a href="config.html#modules" title="6.1. Modules (Scope, Frontier, and Processors)">Section 6.1, “Modules (Scope, Frontier, and Processors)”</a>). For now we are only interested in the option second from the top named <span class="bold"><strong>Select crawl scope</strong></span>. It allows you to specify the limits of the crawl. By default it is limited to the domains that your seeds span. This may be suitable for your purposes. If not you can choose a broad scope (not limited to the domains of its seeds) or the more restrictive host scope that limits the crawl to the hosts that its seeds span. For more on scopes refer to <a href="config.html#scopes" title="6.1.1. Crawl Scope">Section 6.1.1, “Crawl Scope”</a>.</p><p>To change scopes, select the new one from the combobox and click the <span class="emphasis"><em>Change </em></span>button.</p><p>Next turn your attention to the second row of tabs at the top of the page, below the usual tabs. You are currently on the far left tab. Now select the tab called <span class="emphasis"><em>Settings</em></span> near the middle of the row.</p><p>This takes you to the Settings page. It allows you to configure various details of the crawl. Exhaustive coverage of this page can be found in <a href="config.html#settings" title="6.3. Settings">Section 6.3, “Settings”</a>. For now we are only interested in the two settings under <span class="bold"><strong>http-headers</strong></span>. These are the <code class="literal">user-agent</code> and <code class="literal">from</code> field of the HTTP headers in the crawlers requests. You must set them to valid values before a crawl can be run. The current values upper-case what needs replacing. If you have trouble with that please refer to <a href="config.html#httpheaders" title="6.3.1.3. HTTP headers">Section 6.3.1.3, “HTTP headers”</a> for what's regarded as valid values.</p><p>Once you've set the <span class="bold"><strong>http-headers</strong></span> settings to proper values (and made any other desired changes), you can click the <span class="emphasis"><em>Submit job</em></span> tab at the far right of the second row of tabs. The crawl job is now configured and ready to run.</p><p>Configuring a job is covered in greater detail in <a href="config.html" title="6. Configuring jobs and profiles">Section 6, “Configuring jobs and profiles”</a>.</p><p><span class="bold"><strong>Step 3.</strong></span> <span class="emphasis"><em>Running the job</em></span></p><p>Submitted new jobs are placed in a queue of pending jobs. The crawler does not start processing jobs from this queue until the crawler is started. While the crawler is stopped, jobs are simply held.</p><p>To start the crawler, click on the Console tab. Once on the Console page, you will find the option <span class="emphasis"><em>Start</em></span> at the top of the <span class="bold"><strong>Crawler Status</strong></span> box, just to the right of the indicator of current status. Clicking this option will put the crawling into <span class="emphasis"><em>Crawling Jobs</em></span> mode, where it will begin crawling any next pending job, such as the job you just created and configured.</p><p>The Console will update to display progress information about the on-going crawl. Click the <span class="emphasis"><em>Refresh</em></span> option (or the top-left Heritrix logo) to update this information.</p><p>For more information about running a job see <a href="running.html" title="7. Running a job">Section 7, “Running a job”</a>.</p><p>Detailed information about evaluating the progress of a job can be found in <a href="analysis.html" title="8. Analysis of jobs">Section 8, “Analysis of jobs”</a>.</p></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="wui.html">Prev</a> </td><td align="center" width="20%"> </td><td align="right" width="40%"> <a accesskey="n" href="creating.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">3. Web based user interface </td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%"> 5. Creating jobs and profiles</td></tr></table></div></body></html>
?? 快捷鍵說(shuō)明
復(fù)制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號(hào)
Ctrl + =
減小字號(hào)
Ctrl + -