Tags:
Programming
Ruby on Rails
I'll warm up with a quick programming article. My regular readers will leave now and not notice how rusty my writing is. People stumbling in will have low expectations. Perfect! For some reason, I decided to install the long-named IBM OmniFind Yahoo! Edition search engine on my home web server that my web sites run on. I have a web site that could really benefit from a dedicated web crawler, since Google has been ignoring it and only indexing a few pages.
So I point it to the new website, and get this error: "The URL does not have any pages available to crawl". The site has almost 300 pages to crawl, so I scratched my head in confusion and dug deeper to find the root cause of the problem. I found out that the OmniFind web crawler was requesting pages in XML format, rather than HTML. Since I had not set up the site to respond properly to XML requests, it was giving empty pages to the crawler. So of course, the crawler thought it was looking at an empty web site. From my testing, this will only happen with resources in Ruby on Rails 1.2. Older Rails apps received requests for HTML as I had expected. So, Rails 1.2 is interpreting the OmniFind crawler requests as being for XML, although I could not find anything in the request that indicated this. Could be a bug in Rails 1.2, or ambiguity in the requests. This introduced me to the importance for web crawlers to identify themselves to the web sites they are crawling. The OmniFind docs strongly recommended users configure OmniFind so that it provides identification of itself, which is done in the Crawl Web Sites > Manage Web Sites > Crawler Settings section of the admin web app.
So, to identify the crawler, I used my e-mail address and a string I made up: "mighty_omnifind". Then, in my Rails app, I added the following to the application.rb controller helper: class ApplicationController < ActionController::Base So, if the HTTP request is coming from something that identifies itself as "mighty_omnifind", then I force the requested format to be HTML. Easy pickings. I couldn't find any documentation about params[:format], but I'm getting used to the way Rails thinks, so made a guess that such a param exists. Lucky for me, it works! I recommend IBM OmniFind Yahoo! Edition as an extremely easy-to-use, easy-to-install, and free search engine. The install is 3 clicks, and once you add your web site(s), it should run itself. However, since I have more than one web site running on this one box, it would be nice to be able to have a separate index for each web site. But there are ways to work around this, and it's hard to complain when the price is $0. |
About Me
![]()
![]() ![]() |
Post a comment: