Microsoft office documents, Pdf documents, Html documents – Google Search Appliance Deployment Governance and Operational Models User Manual
Page 5: Document feeds, Data classification/taxonomy
![background image](/manuals/552827/5/background.png)
5
These guidelines deal mostly with the indexing of additional metadata along with document content.
Metadata can enrich content in the GSA’s index. For just one example of the type of metadata attributes
that can be added to enrich document classification, refer to t
provides standards for a base set of text fields that can describe a resource.
Content owners can follow some document creation principles that will help effectively distinguish and
uniquely identify their documents. This point is important, as enterprise documents have less of a defined
URL
structure
that is useful for determining relationships on the world wide web. For this reason, unique
document identifiers in an enterprise corpus of documents are significant.
Microsoft Office documents
In the properties of a Microsoft Office document, populate the following fields:
●
Author
●
Title
●
Comments
If there is a tendency to create a large number of documents that originate from templates, the
importance of changing default document properties should be heavily stressed.
PDF documents
In the properties of a PDF document, populate the following fields:
●
Author
●
Title
●
Creator
If the document is not authored directly by a PDF publishing tool, make sure these properties are
populated in the source document before it is converted to a PDF.
HTML documents
In an HTML document, include the following information:
●
●
description (handwritten or programmatically generated)—be sure to write quality descriptions,
even though they won’t be displayed on the page itself
Be sure to differentiate descriptions of different pages. To restrict indexing of certain page content, use
googleoff tags.
Document feeds
The GSA feed protocol allows for indexing of metadata along with the underlying content. Include in the
document feed any other metadata that are otherwise not present in the content, which can enrich the
indexed content.