Configuring crawl of public content – Google Search Appliance Getting the Most from Your Google Search Appliance User Manual

Page 18

Google Search Appliance: Getting the Most from Your Google Search Appliance

Crawling and Indexing

Configuring Crawl of Public Content

To configure a search appliance to crawl a content source, you specify top-level URLs and directory
addresses and links that the search appliance should follow by using the Content Sources > Web Crawl
> Start and Block URLs page in the Admin Console. In addition to specifying start URLs, you can also
specify URLs that the search appliance should not follow and crawl.

By default, the search appliance crawls in continuous crawl mode. This means that after the Google
Search Appliance creates the index, it always crawls content sources looking for new or modified
content and updates the index to ensure that it contains the freshest listings. The search appliance can
also crawl content according to a schedule.

Configure continuous crawl by performing the following steps with the Admin Console:

Specifying where to start the crawl by listing top-level URLs and directory addresses in the Start
URLs section on the Content Sources > Web Crawl > Start and Block URLs page, shown in the
following figure.

Specifying links for the search appliance to follow and index by listing patterns in the Follow
Patterns section.

Listing any URLs that you don’t want the search appliance to crawl in the Do Not Follow Patterns
section.

Saving the URL patterns.

After you save the URL patterns, the search appliance begins crawling in continuous mode.

If you prefer to have the search appliance crawl according to scheduled times, you must also perform
the additional following tasks by using the Content Sources > Web Crawl > Crawl Schedule page in
the Admin Console:

Selecting scheduled crawl mode.

Creating a crawl schedule.