Traversal, Feeds, Indexing – Google Search Appliance Planning for Search Appliance Installation User Manual

Page 7

Google Search Appliance: Planning for Search Appliance Installation

•

Start URLs, which control where the crawl begins. All content must be reachable by following links
from one or more start URLs.

•

Follow and Crawl URLs, which set the patterns of URLs that are crawled. Use follow and crawl URLs
to define the paths to pages and files you want crawled. If a URL in a crawled document links to a
document whose URL does not match a pattern defined as a follow and crawl URL, that document
is not crawled.

•

Do Not Crawl URLs, which designate paths to pages and files you do not want crawled and file types
you do not want crawled.

If the search appliance is crawling a web site, the crawl software issues HTTP requests to retrieve
content files in the locations defined by the URLs and to retrieve files from links discovered in crawled
content. If the search appliance is crawling a file share, the crawl software uses the SMB or common
Internet file system (CIFS) protocol to locate and retrieve the content files. For more information on
crawl, see Administering Crawl, which also includes checklists of crawl-related tasks in the “Crawl Quick
Reference.”

Traversal

Traversal is the process by which the Google Search Appliance locates content to be indexed in a
content repository such as SharePoint or Lotus Notes. Traversal is a process in which the connector
issues queries to the repository to retrieve document data to feed to the Google Search Appliance for
indexing.

Feeds

Feeding is the process by which you direct content to the Google Search Appliance instead of having the
search appliance locate content. Feeding is a push process, in which the content files are pushed to the
Google Search Appliance. You can feed several types of content to a Google Search Appliance:

•

A list of URLs

The crawl software fetches documents listed in the URLs.

•

Content files

The files and their URLS are fed to the search appliance.

•

External metadata that is not stored in a relational database or where it is difficult to map the
metadata to the content file

For more information on feeding, see the Feeds Protocol Developer’s Guide and External Metadata Indexing
Guide.

Indexing

Indexing is the process of adding the content from the crawled documents to the index.

After a file is retrieved by the crawl, the file is converted to an HTML file and submitted for indexing. The
indexing process extracts the full text from each content file, breaks down the text, and adds both the
text and information such as date and page rank to the index so that users’ search requests can be
satisfied. The index and the HTML versions of each indexed file are stored on the search appliance.