11 am Tuesday session
I have handouts for this presentation, and the slides will be on the conference website in the coming weeks. This session was presented by someone from Yale, who recently implemented the search appliance.
Google won’t budge much on their price. "Hey, we’re Yale. Cut us a deal." "Hey you’re Yale, you can afford it."
The appliance lets you create collections, which are defined by url groups. For instance, you could collect urls of HR related sites then let users narrow their search to only that collection. Helps weed out worthless links.
Search result can be customized by XSLT to create a branded/customized front end. This goes way beyond just putting your logo on your page. You can use XSLT to include an icon indicating file type - PDF, Word, etc.
Keymatches let you promote specific web pages on your site. This works similar to ad link on google’s site. To create a keymatch you must provide the word, phrase, or exact match criteria for which a specific result will be returned.
Onebox serviers delivers relevant , realtime results from certain 3rd party sytems. This works like typing in weather and getting weather reports in google. One of the 3rd partis is Blackboard. That means you should be able to use the campus search to find a course catalog.
Server status reports include info on the status of webcrawls. You can see where you got 404s, etc. Helps you id broken links or areas denied because of authentication, etc. You also have access to search info, so you can watch for search patterns to further refine search results and websites.
Tips and tricks
Google operators (limiting to site, show all pages linking to url, etc)
Use metadata in your pages and docs. Creating a meta tag "date" field will allow for GSA to use that to sort documents rather than the file last modified date.
Integration into apps.
- Offload overhead of indexing in custom appls onto GSa
- Send feeds to GSA to index content
- Query databasese and index results
Search protected content while maintaining the protection.
Monitor activity through web panel
- basic operations
- certain dynamic content generates black holes. For intance, GSA gets stuck in an endless crawl loop of a site
- Example: calendar application with no end, GSA could crawl to the year 5000!
- drives up license cost
Managers can be assigned to particular parts to distribute maintenance.
Yale has been running it for just over a year.
Bugs
- Collections can get corrupted and not display results. Must reboot the GSA.
- Can take up to 4 hours to reboot
- creating a report for a large amount of collections for a long persiod will fail
- database queries are not run upon crawl; google’s solution: run a python script that logs in and manually runs the query
Fail over
Initially not purchased with hot standby or 24/s7 support. After realizing the reboot problem, they upped their license to get the hot standby. Those can be purchased at any point.
New 5.0 version was just released on Oct. 11. Includes secure file system crawling; date biasing; "google enterpries labs" such as search as you type.
I have contact info for the presenter in a file.
Popularity: 18% [?]










No comments yet.