beautypg.com

Distributed crawling overview – Google Search Appliance Configuring Distributed Crawling and Serving version 6.14 and later User Manual

Page 6

background image

Google Search Appliance: Configuring Distributed Crawling and Serving

6

Distributed Crawling Overview

In the following diagram, four search appliances are configured with distributed crawling. Each search
appliance is designated as a particular shard in the distributed crawling configuration. Shard 0 is the
master search appliance. The shard number is incremented by 1 for each additional search appliance in
the configuration. The distributed crawling configuration is created on the master and the settings are
exported in a configuration file. The configuration file is uploaded to Shard 1, Shard 2, and Shard 3. After
the configuration file is uploaded, all search appliance features are configured on the master. The
indexes on all of the nodes are synchronized when the master node takes control of the non-master
nodes. The crawl is distributed among the search appliances and a single index is created. Each search
appliance is considered a primary (non-replica) search appliance. All of the search appliances can serve
results. The results for a search query will be identical regardless of which search appliance serves the
results.

After the distributed crawl configuration is set up, the four search appliances behave as if they are a
single search appliance. Crawling, serving, collections, front ends, and other features are configured on
Shard 0, the master node of the configuration. Feeds are sent only to the admin master. The crawl
process is automatically distributed among the four search appliances. Any of the nodes can serve
results. Each search appliance in the distributed crawl configuration communicates with all of the other
search appliances. The diagram above does not show each of the connections between search
appliances.

After the configuration is set up, you can add nodes on the Admin Console and the index will
automatically be redistributed among the existing and new nodes. You can delete nodes by disabling
distributed crawling and serving, resetting the index on each search appliance, and reconfiguring
distributed crawling and serving, then reindexing the content.