Crawling and indexing, Crawling public content, Chapter 4 – Google Search Appliance Getting the Most from Your Google Search Appliance User Manual

Page 16

Google Search Appliance: Getting the Most from Your Google Search Appliance

Chapter 4

Crawling and Indexing

Chapter 4

After the Google Search Appliance has been set up (see “Setting Up a Search Appliance” on page 12), you
can configure the search appliance to crawl the content sources that you identified during the planning
phase, as described in “Planning” on page 10.

Crawl is the process by which the Google Search Appliance discovers enterprise content and creates a
master index. The resulting index consists of all of the words, phrases, and meta-data in the crawled
documents. When users search for information, their queries are executed against the index rather
than the actual documents. Searching against content that is already indexed in the appliance is not
interrupted, even as new content continues to be indexed.

The Google Search Appliance can crawl:

•

Public content (see “Crawling Public Content” on page 16)

•

Controlled-access content (see “Crawling and Serving Controlled-Access Content” on page 19)

The Google Search Appliance is also capable of indexing:

•

Content in non-web repositories, such as content management systems (see “Indexing Content in
Non-Web Repositories” on page 22)

•

Hard-to-find content, such as content that cannot be found through links on crawled web pages
(see “Indexing Content in Non-Web Repositories” on page 22)

•

Database content (see “Indexing Database Content” on page 27)

This section briefly describes how the Google Search Appliance indexes each type of content.

Crawling Public Content

Public content is not restricted in any way; users don’t need credentials to view it. Some of the most
common forms of public content include:

•

Employee portals

•

Frequently Asked Questions

•

Employee policies

•

Benefits information