beautypg.com

What file sizes can be indexed, What content locations can be crawled or traversed – Google Search Appliance Planning for Search Appliance Installation User Manual

Page 11

background image

Google Search Appliance: Planning for Search Appliance Installation

11

The Google Search Appliance cannot index text contained in graphic file formats, such a JPEG, GIF, or
TIFF. When a file in a graphic format is submitted for indexing, text embedded in the graphic is not
indexed. However, the file name is indexed. If any metadata is associated with the graphic in an HTML
meta tag that metadata is indexed.

Certain file formats are excluded from the crawl by default on the search appliance Admin Console.
When you configure the crawl, ensure that the field for excluded URLs and file formats correctly reflects
the file types you do not wanted crawled and indexed.

What File Sizes Can Be Indexed?

By default, the search appliance indexes up to 2.5MB of each text or HTML document, including
documents that have been truncated or converted to HTML. After indexing, the search appliance caches
the indexed portion of the document and discards the rest. You can change the default by entering an
new amount of up to 10MB.

To change the default amount, use the Crawl and Index > Index Settings page in the Admin Console.

What Content Locations Can Be Crawled or
Traversed?

The Google Search Appliance can crawl files located on an intranet or a web site.

If you install a connector, the Google Search Appliance can also traverse content located in a content
repository such as FileNet or Documentum. For more information, read Introducing Connectors, the
Google Connector Developer’s Guide and the configuration documents for the different connectors.

Content on a web site is crawled using the HTTP or HTTPS protocol.

Content on an intranet is crawled using the SMB or CIFS protocol. Intranet files are typically stored in a
Windows shared directory or in a web-enabled virtual directory. See the Windows Help system for
information on creating a shared directory. You can create a virtual directory in several ways:

By using the Virtual Directory Creation Wizard of Internet Information Services (IIS)

By importing a configuration file

By using the lisvdir.vbs script

By using the Apache web server to enable directory browsing

For more information on creating virtual directories, see the Windows Help system.

Content files can also be located on Macintosh, UNIX, or Linux computers on an intranet. On Macintosh
computers, use the CIFS protocol. On UNIX or Linux computers, you can web-enable the file locations
and use HTTP or HTTPS for crawling, or you can use the SMB protocol without web-enabling the
locations.

If a file is in a location that requires a password for access, whether on an intranet for a web site, you
must provide a user ID and password for the location on the Crawler Access page of the Admin Console.