Faithless – Mass Destruction
At work we have a Google Mini search appliance box, and it is presently indexing our Drupal based intranet portal as a proof-of-concept for a full blown Google Search Appliance solution. Unfortunately the results it has been giving have been messed up with too many duplicates which I’ve been struggling to prune down, I battled for a few days trying to figure this out and then left it to focus on other things with the hope that I’d have a eureka moment. And I did, it was a major D’oh moment. This is basically because there are multiple ways to get to a page in Drupal, like so:
http://someportal.domain.tld/drupal/NicePageName
http://someportal.domain.tld/drupal/node/123
http://someportal.domain.tld/drupal/?q=/node/123
http://someportal.domain.tld/drupal/?q=NicePageName
etc…
And then on top of that you have all of the page functions that you really don’t want to index, especially if you’re wanting your search results to be relevant and you want to get the most out of a limited document licence, so you might find results like this in your search:
http://someportal.domain.tld/drupal/NicePageName?q=comments
http://someportal.domain.tld/drupal/NicePageName?q=quotes
http://someportal.domain.tld/drupal/NicePageName?q=print
http://someportal.domain.tld/drupal/NicePageName?q=add
Not to mention the same for the node/nnn pagenames eg:
http://someportal.domain.tld/drupal/node/123?q=add
http://someportal.domain.tld/drupal/?q=/node/123?q=add
As well as links elsewhere and to itself like so:
http://someportal.domain.tld/drupal/NicePageName?q=/node/123
Or to put it another way, I did a test with the Google search for inurl:NicePageName and 158 results were returned. There should have been 2 results returned for this particular pagename, namely /drupal/NicePageName and /drupal/?q=NicePageName.
Fortunately the Google Mini allows you to enter exclusions in the form of regular expressions, and myself and a colleague had built up an impressive list of exclusions in a bid to trim the duplicates back. I wasn’t even thinking about the problem when the answer came to me – I had neglected to use a range. dee ew aych: DUH! These two rules replaced a dozen others:
regexp:.*/drupal/[A-Z,a-z].*?q=
regexp:.*/drupal/?q=[A-Z,a-z].*?q=
If you’re struggling to follow the meaning of these, here’s a quick rundown. Any URL with /drupal/ in it, followed by an Alphabetic character in the A-Z range (ie uppercase) and a-z range (ie lowercase), followed by any number of characters (.* is the wildcard), and including ?q= is ignored. URLs like so:
http://someportal.domain.tld/drupal/NicePageName?q=add
http://someportal.domain.tld/drupal/node/123?q=/node/100
Then the second rule is the same except it starts with /drupal/?q= followed by the ranges, wildcard and a second ?q=, so URLs like the following will be ignored:
http://someportal.domain.tld/drupal/?q=/node/123?q=print
http://someportal.domain.tld/drupal/?q=NicePageName?q=comments
I was pretty dense not to click on to those regexp’s earlier, my brain must have been running on autopilot for weeks. Oh, and I absolutely LOVE the Google Search Appliance kit. It’s the smartest thing you can do for enterprise search in my opinion. Search is so important to get right, and you simply cannot go wrong with the GSA gear. As far as tightly integrating a GSA and Drupal, keep an eye on this thread, we did consider hacking this module too.
Oh, and if you’ve got the crawler setup to access with a rw capable account, you’ll want to definately block out ?q=revision.* before you point your Google box at your Drupal portal, otherwise you’ll find the crawler will hit the revisions page of a page and then try to follow all the links on the revision page, including the postmethod rollback links – yes, the Google box can potentially rollback your portal, though typically it will get stuck in a loop on only a handful of pages so the damage will be minimal by the time you realise.