Defining the xml record for a document, Full feeds and incremental feeds – Google Search Appliance Feeds Protocol Developers Guide User Manual

Page 9

Google Search Appliance: Feeds Protocol Developer’s Guide

To ensure that the search appliance does not crawl a previously fed document, use googleoff/googleon
tags (see “Excluding Unwanted Text from the Index” in Administering Crawl) or robots.txt (see “Using
robots.txt to Control Access to a Content Server” in Administering Crawl).

To update the document, you need to feed the updated document to the search appliance. Documents
fed with web feeds, including metadata-and-urls, are recrawled periodically, based on the crawl settings
for the search appliance.

Note: The metadata-and-url feed type is one way to provide metadata to the search appliance. A
connector can also provide metadata to the search appliance. See “Content Feed and Metadata-and-
URL Feed” in the Connector Developer’s Guide. See also the External Metadata Indexing Guide for
information about external metadata.

Full Feeds and Incremental Feeds

Incremental feeds generally require fewer system resources than full feeds. A large feed can often be
crawled more efficiently if it is divided into smaller incremental feeds.

The following example illustrates the effect of a full feed:

Create a new data source by pushing a feed that contains documents D0, D1 and D2. The system
serves D0, D1, and D2.

Use the same data source name, you push a full feed that contains documents D0, an updated D1,
and a new D3. When the feed processing is complete, the system serves D0, the updated D1, and
the new D3. Because document D2 was not defined in the full feed, it is removed from the index.

The following example mixes full and incremental feeds:

Create a new data source by pushing a feed that contains documents D0, D1 and D2. The system
serves D0, D1 and D2.

Push an incremental feed that defines the following actions: “add” for D3, “add” for an updated D1,
and “delete” for D2. The system serves D0, updated D1, and D3. D0 was pushed by the first feed;
because it is not referenced in the incremental feed, the D0’s contents remain in the search results.

Push a full feed that contains documents D0, D7, and D10. The system serves D0, D7, and D10 when
the full feed processing is complete. D1 and D3 are not referenced in the full feed, so the system
removes them from the index and does not add them back.

Defining the XML Record for a Document

You include documents in your feed by defining them inside a record element. All records must specify
a URL which is used as the unique identifier for the document. If the original document doesn’t have a
URL, but has some other unique identifier, you must map the document to a unique URL in order to
identify it in the feed.

Each record element can specify following attributes:

•

url (required)—The URL is the unique identifier for the document. This is the URL used by the
search appliance when crawling and indexing the document. All URLs must contain a FQDN (fully
qualified domain name) in the host part of the URL. Because the URL is provided as part of an XML
document, you must escape any special characters that are reserved in XML. For example, the URL
http://www.mydomain.com/bar?a=1&b2 contains an ampersand character and should be
rewritten to http://www.mydomain.com/bar?a=1&b2.