Ik_guy

Geekskillz_logo1


"Everything that boots is beautiful."


Aug 14
2007

Rails 1.2 vs OmniFind Yahoo! Edition

Tags: Programming   Ruby on Rails


I'm back from vacation, back in the city, back in the workforce, and back with my MacBook on my lap. Time to show this blog some loving again.

I'll warm up with a quick programming article. My regular readers will leave now and not notice how rusty my writing is. People stumbling in will have low expectations. Perfect!

For some reason, I decided to install the long-named IBM OmniFind Yahoo! Edition search engine on my home web server that my web sites run on. I have a web site that could really benefit from a dedicated web crawler, since Google has been ignoring it and only indexing a few pages.

So I point it to the new website, and get this error: "The URL does not have any pages available to crawl". The site has almost 300 pages to crawl, so I scratched my head in confusion and dug deeper to find the root cause of the problem.

I found out that the OmniFind web crawler was requesting pages in XML format, rather than HTML. Since I had not set up the site to respond properly to XML requests, it was giving empty pages to the crawler. So of course, the crawler thought it was looking at an empty web site. From my testing, this will only happen with resources in Ruby on Rails 1.2. Older Rails apps received requests for HTML as I had expected. So, Rails 1.2 is interpreting the OmniFind crawler requests as being for XML, although I could not find anything in the request that indicated this. Could be a bug in Rails 1.2, or ambiguity in the requests.


This introduced me to the importance for web crawlers to identify themselves to the web sites they are crawling. The OmniFind docs strongly recommended users configure OmniFind so that it provides identification of itself, which is done in the Crawl Web Sites > Manage Web Sites > Crawler Settings section of the admin web app.

So, to identify the crawler, I used my e-mail address and a string I made up: "mighty_omnifind". Then, in my Rails app, I added the following to the application.rb controller helper:

class ApplicationController < ActionController::Base

before_filter :hack_for_omnifind

def hack_for_omnifind
if @request.env["HTTP_USER_AGENT"] == "mighty_omnifind"
params[:format] = 'html'
end
end
end

So, if the HTTP request is coming from something that identifies itself as "mighty_omnifind", then I force the requested format to be HTML. Easy pickings.

I couldn't find any documentation about params[:format], but I'm getting used to the way Rails thinks, so made a guess that such a param exists. Lucky for me, it works!

I recommend IBM OmniFind Yahoo! Edition as an extremely easy-to-use, easy-to-install, and free search engine. The install is 3 clicks, and once you add your web site(s), it should run itself. However, since I have more than one web site running on this one box, it would be nice to be able to have a separate index for each web site. But there are ways to work around this, and it's hard to complain when the price is $0.



StumbleUpon This! Bookmark This Article Digg This Story

  Staggo Lee, on Tuesday, August 14, 2007 at 23:36 Eastern Daylight Time:
I hope you had a great vacation. At first, this post went way over my head. Then, I really studied it. Well, this info does address some issues/questions I have, so I'll give it a try.
  Neil, on Wednesday, August 15, 2007 at 01:01 Eastern Daylight Time:
I had a great vacation, thanks Staggo! I suspect that this article isn't really for anyone. =) I ran into this problem, searched for help on Google, and found nothing. So I wrote this just in case someone else tries to mix OmniFind with a Rails app. I found no hits in Google discussing OmniFind and Rails together, so I think it is an extremely uncommon combination. It might also be that OmniFind has failed to market itself and is only being used by IBM's loyal customer base in conjunction with IBM products (Lotus especially).


Post a comment:

Name:
E-mail: (will not be sold or published)
Website: (Optional)

Your comment:

Are you human?
Please enter the
text in the picture:


About Me

ThinkGeek

Feed-icon Technorati


Loot For Geeks:
4inkjets Great Prices and Best Quality!
Man's Wig! All size heads! Handsome! Sideburns! Modacrylic Fiber!
Protection
!
Make money online selling grit! Famous men sell grit!