Menu Home

Building my own version of locate

For those of you who don’t use Linux or some variant of UNIX in your day to day lives, allow me to explain what locate is.

You have a filesystem, and you want to search it for files that may match a particular name, or a pattern. For that, you can use the command find. Problem was, back in the late 70’s and early 80’s and even through to now on a lot of systems, find is essentially an unindexed search. It will trawl your entire filesystem looking for what you have commanded it to search for.

This can result in long waits and heavily impacted I/O, meaning reduced performance for the rest of the system. And as filesystem sizes got bigger, it got worse.

And so, early in 1983, locate was proposed. This system would essentially run find routinely, and build up a database. You could then perform simple searches against that database – in other words: an indexed search. The only main downside is that it would not be a real-time index. locate has got better as time has gone on, with other descendant versions released, such as secure locate (i.e. slocate), which prevents you from getting search results for files you don’t have access to, and rlocate (I assume the ‘r’ stands for ‘real-time’), which takes slocate and adds a kernel module that updates the database in real-time. Add a new file? The locate database is immediately updated. Problem with rlocate is it seems to have begun and ended with the 2.6 branch of the Linux kernel.

And then you have the incumbent, merging locate (i.e. mlocate) which does the same as slocate as well as throwing in a more efficient way to update its database. Essentially it looks at its own database and checks the modified timestamp of every directory. If a directory has a modified timestamp that differs from the database, then mlocate will re-index it. Or at least that’s my understanding of how it works.

These days, some versions of find keep their own database, and so it may be faster in some cases to just use find

But why did they need a database for locate in the first place? Couldn’t they just have a plaintext file and use grep? (For the uninitiated, grep is a very powerful tool that lets you search a file’s contents).

Well, back in 1983, they asked the same questions. And it has no doubt been asked hundreds of times since. Back then, hard drive space wasn’t so available, and grep on a PDP-11 wasn’t exactly fast. So the database solution was chosen in order to save space and to improve search performance.

So what does all of this have to do with me? Well, it’s 2016. We have hard drive space galore, and grep is fast these days thanks to modern hardware. And I have an itch to scratch. Long story short, this is something I needed to do in order to enable some further work.

Imagine you have a host that doesn’t have locate or some variant of it. Basically the moment you step out of Linux and into Solaris, HP-UX, AIX and others, you find that locate is sadly not as ubiquitous as you would expect. But because hardware isn’t a limitation, we can easily create a simple version using basic shell scripting. So that’s what I did. You can get the code here:

https://github.com/rawiriblundell/simple-locate

So how does it perform? Let’s start with find, and note that this is on an SSD:

# time find / -type d -name "bin"
...
real	0m16.184s
user	0m2.224s
sys	0m5.888s

Next, the incumbent, mlocate:

# time locate -r '.*/bin$'
...
real	0m9.778s
user	0m9.656s
sys	0m0.032s

Now, this uses the regex capability of locate to give us a more precise list. If we just use a raw check, which will return anything and everything with the characters ‘/bin’ in it:

# time locate /bin
...
real	0m1.394s
user	0m0.724s
sys	0m0.024s

But this does return a bunch of useless crud, which you would then waste time filtering out with grep -v

And finally, my version, which is essentially an over-glorified egrep and a plain-text index:

# time ./smlocate /bin$
...
real	0m0.074s
user	0m0.028s
sys	0m0.028s

My version also attempts to replicate slocate‘s separation of root-readable and user-readable indexes. It might be interesting to see how far I can stretch this, or whether it’s good enough just as it is.

One interesting lesson from testing it, however: On FreeBSD, it kept throwing an error from one of the chown commands.

chown: root: Invalid argument

This was caused by the command chown root:root. Of course, it took a few minutes for me to remember that there is no group in FreeBSD named root. The fun thing is, instead of having a solution like:

if [ "`uname`" = FreeBSD ]; then
  chown root:wheel blah
else
  chown root:root blah
fi

It’s a lot simpler than that, as chown can accept UID’s and GID’s:

chown 0:0 blah

Nicely portable.

So, if you find my hackish version of locate useful, let me know.

Categories: BSD Geeking Out Lunix Lunacy Technology

rawiri