Unleashing the power of geospatial indexing with Scala and MongoDB

Jeroen van Wilgenburg

Right now I'm following some geospatial tweets and came across an interesting one about a new option to add a geospatial index to a MongoDB. Since I've done some stuff with Scala recently I decided to insert the data into MongoDB with Scala using scamongo. Unfortunately the scamongo Scala driver for MongoDB gave me too much trouble, so I switched to the java driver.

I have no experience with MongoDB at all, it just sounds cool and people tell me it's cool. That's enough for me to do some research. I recently finished my bachelor thesis about collecting gps points and discovering running routes from those points. The main bottleneck in my application was counting the number of points within a certain radius of a point. You basically have to check all the points to see whether they're inside the radius. Let's say you have 10^6 points (and I have), then (10^6)*(10^6) steps are needed.
I advised to use a geospatial index with PostGIS in my conclusion, but that's not really new and cool, so let's give it a shot with MongoDB.

What is MongoDB?

mongodb-logo"MongoDB (from 'humongous') is a scalable, high-performance, open source, schema-free, document-oriented database."
Well that's quite a proper description I think. The idea is to insert documents (JSON objects) into a collection (comparable to a table in an RDBMS) in a database. Queries don't use SQL, they are a bit comparable with QBE.

Inserting and querying can be done with a plethora of languages. There is official support for C, C++, Java, Javascript, Perl, PHP, Python and Ruby. The community supports another bunch of languages which are listed here).
Installing MonogDB is very easy. Just extract a file and start the database with mongod. Since it is different amongst the various operating systems I have to refer to the quickstart guide (and I can't explain it any clearer anyways).
With the command mongo a CLI is started. Let's insert an object and find that object.
> db.foo.save( { a : 1 } )
> db.foo.find()

The first line of code above saves a document in the collection foo with key/value a/1. When you do a find (second line) the same object is displayed, now with a unique id (the key _id is used for this)

There is a lot to tell about MongoDB, but many people did that already, so let's focus on the geospatial indexing in combination with Scala.

How a geospatial index works in MongoDB

A geospatial index is just like a normal database index, but in two dimensions (or sometimes three when you account for height). The index in MongoDB is a so called grid index (there are other indexes like r-tree and quadtree, but that might be too much info for a blog).
The geo-index of MongoDB encodes a geohash on top of a standard MongoDB b-tree. Geohashing (not to be confused with geocaching) is a way to divide a coordinate system into hierarchical buckets of grid shape.
It basically divides an area in two for every bit and so on. An elaborate description can be found on the wikipedia page.
A drawback of geohashing is the wrapping (-180 and 179 are very close for example), this can be solved by doing a grid search after the initial scan. It's a bit comparable by how google maps loads the map images, it starts with the center of the viewport and then the surrounding areas are downloaded.

Our Laapersveld office in a geohash: u179jbzdexru

Using the java driver (in Scala)

Since the Scala driver gave me too much trouble I tried the java driver. It can be downloaded from a github repo. Just include the jar in your Scala project and you're ready to go.

To query everything from collection foo, the following code is needed:
val mongo = new Mongo( "localhost" , 27017 )
val db = mongo.getDB( "test" )
val coll = db.getCollection("foo")
val cursor = db.find()
while (cursor.hasNext) {
println(cursor.next)
}

A big warning from the API which is important enough to repeat is that toArray and length on cursor convert the cursor to an array, which can be a very expensive operation, especially on humongous databases!

Inserting the data

Let's take some arbritary data from my data set:
{ "routeId" : "9327014.tcx", "id" : "1105409", "loc" : { "x" : 5.1400218, "y" : 52.0614276 } }
As you can see I use a nested object, we need this for the index later on and it's to show you how to do nested objects (plain objects are just too easy ;) ).
val doc = new BasicDBObject()
doc.put("routeId", "9327014.tcx")
doc.put("id", "1105409")
val loc=new BasicDBObject()
loc.put("x", 5.1400218)
loc.put("y", 52.0614276)
doc.put("loc", loc)
coll.insert(doc) /* coll is a DBCollection */

That's all! No transactions to commit or connections to close.

Creating the index

Creating the index is quite easy, it can be done from the java driver or on the command line.

The objects in the database look like this:
{ "routeId" : "activity_123456.tcx", "id" : " 1105402", "loc" : { "x" : 5.1415501, "y" : 52.0616409 } }
The x,y coordinates are wrapped in a loc key, because that's the way MongoDB likes it. The first two columns in the loc object are treated as x and y coordinates.
On the command line execute db.foo.ensureIndex( { loc: "2d" } ) and your index is created.
Creating the index in Scala is also possible:
val index=new BasicDBObject()
index.put("loc","2d")
coll.ensureIndex(index) /* coll is a DBCollection */

Forgetting to create an index and doing spatial queries might work, but sometimes I got strange crashes (and bad performance). So check this before throwing MongoDB out of your window ;-)

Querying the data and comparing the speed

It's now possible to do a geospatial query:
db.foo.find( { loc : { $near : [5.14,52.06] } } )
This will display all the points near 5.14, 52.06 sorted by distance from that point. Note that by default a maximum of 100 points is returned! This can be solved by appending the query with .limit(500), for a maximum of 500 results. Note that the higher the number the slower your query gets, so be careful with increasing this number.

I wanted to count the number of points within a certain radius. For this you need to add bounds criteria to the query:
db.foo.find({"loc" : {"$within" : {"$center" : [[5.14,52.06], 0.0005]}}})
Note that the bounds criteria were added two weeks ago (version 1.3.4) and I'm not sure whether this already is ready for production. It works fine for me, but you might be a bit careful when using it for real.

In Scala building the query is a bit more verbose. You have to construct an object and query a collection for that object (note that this is the same as the first query):
val query = new BasicDBObject()
val loc = new BasicDBObject()
val near = new BasicDBList()
near.put( "0", 5.15 )
near.put( "1", 52.069 )
loc.put("near", near)
query.put("loc",loc)
/* coll is a DBCollection */
coll.find(query).limit(50)

I wanted to do a speed comparison, but I decided it isn't that important yet. Since MongoDB and especially the geo-index is improving every day it's enough to say that it's a lot faster on large data sets (for me > 400,000 points). The small sets ( < 200,000 ) have comparable speed.

Conclusion

MongoDB is cool and the people who told me that were right. Besides cool I really loved it. I got my database running within a few minutes. The documentation is very well written and updated regularly (the new features in the geo-index were updated in the documentation instantly and that's quite unique).
Scamongo was a bit of a disappointment (probably also thanks to my perseverance and wanting quick results ;) ), it looks promising so I might give it another shot within a few months.
The geo-index is really cool. When I started playing around a few weeks ago some features were missing, but within days these features were added (and I had to rewrite some parts of this blog ;) ). Kudo's to the MongoDB guys for giving me cool stuff, updating it regularly and even without breaking!

Sources

MongoDB
Geohashing
MongoDB Java Driver
MongoDB Java Driver API

Comments (3)

  1. Mike Dirolf - Reply

    March 28, 2010 at 7:56 pm

    Really enjoyed this post, thanks for the writeup. Glad you're enjoying MongoDB!

  2. Seth - Reply

    April 20, 2010 at 12:21 am

    Jeroen, I'm interested in the bachelor thesis you mentioned. Is a copy of your research or the final paper available?

  3. NOSQL Talk and References | My missives - Reply

    June 23, 2014 at 8:59 am

    […] <- Justin and Jason speak with Michael Dirolf, a lead developer at 10gen [66] http://blog.xebia.com/2010/03/28/mongodb-geospatial-index/ [67] http://dirolf.com/2010/05/27/inside-mongodb.html [68] http://en.wikipedia.org/wiki/Geohash […]

Add a Comment