4 years with MongoDB

mongodbWhen I met MongoDB, I still had, seared in my mind, the image of a contractor completely destroyed by the failed attempt to do something similar to what I had been asked to do.
Among a number of questionable decisions, the contractor took a road that smelled much like “tradition”, rather than innovation: MySQL.
Now let’s be clear, I’ve nothing against MySQL or relational databases in general. I still believe they’re the very best solution in most “common” situations. But the challenge we’re talking about was way beyond the reasonable capabilities of a relational database.
In a context where every data structure is almost completely fluid and can contain various levels of nested data, a classic approach made up of hundreds of tables and joins simply wasn’t the way to go, and among a variety of NoSQL databases, a document store was the most reasonable choice.

It’s not that MongoDB is a better database, but it certainly is a better choice for some specific purposes

Keep this very clear in mind: contrary to what happens in the relational world where every piece of technology struggles to be a better implementation of an established model, in the NoSQL world every database radically differentiates its capabilities to deliver a limited subset of powerful features, strictly coupled with the philosophy of the data they’re going to store.

Which means, in other words: data first.

Analyze your problem, your data, the use you’re gonna make of it and then search the technology that best matches your use case.
At that time, MongoDB was roaring among developers, especially in the web world. From the information I gathered, MongoDB looked like the panacea for everything, the Tom Hanks that turns a poor script into an acceptable movie. But formal research and -later on- hands on experience showed this is far from being the truth. But for my use case? That looked astonishingly perfect.

What it is

MongoDB is a NoSQL (meaning that you don’t use SQL to query it) database that stores data in a quite different way, compared to what you probably are used to. Instead of going through a declarative phase, you simply throw stuff at it in what they call collections that I would ideally relate to tables. The stuff you throw at it are documents of data, nested at various levels. So remove from your mind the idea of tables, this is much more like an archive of files. Contrary to key / value stores, though, indexing in MongoDB is extremely advanced, allowing multiple, non-unique, composed indexes. By using these indexes, you can search, filter and sort according to virtually any need. As one would expect, drivers for any non-meaningless programming language are available. Data structures are represented as JSON objects and JavaScript is the language of choice when scripting within the MongoDB shell.

Why use it

This is probably the most sensitive question. If you’re looking for a database to replace your SQL database with something “more exotic” or “thought to be more performing”, well maybe you should go on reading and eventually reconsider. Among most NoSQL databases, MongoDB could potentially replace a relational database, but I strongly discourage this approach.
MongoDB is not, by any means, a SQL / relational database replacement for two humongous reasons: it doesn’t work with SQL and it is not relational. Captain Obvious in action.
Querying MongoDB is quite a different experience compared to SQL, though it’s definitely not something I would consider hard or requiring special training.
For the relational part… let’s say this: if you think you don’t need constraints, think twice, be absolutely sure. Managing relations between data items (as you would do in MongoDB to “replace” the relational feature) can be a bit of a drama in an application where data can be this complex without constraints. Any software / person using the database would have the duty to maintain consistency… To be honest: don’t. Relations between documents (and other documents or other information units) should be simple and weak. The system should be able to keep going if they break.
So why use it then?

  • If you are going to store a shitload of complex data where each item is almost self explanatory (documents) and you will, most of the time, query documents to digest them entirely, go ahead.
  • If you need to index your complex documents in various creative ways to retrieve them, count them etc. this works pretty much a lot. To be honest, this is one of the most powerful indexing systems I’ve seen around for this kind of purposes. If your indexing needs are simpler, maybe you should also evaluate other NoSQL servers that might fulfill other needs better.
  • If heavy inserts / edits / deletes are not the real deal of your app, great news because MongoDB is not so good at them. If, on the contrary, they ARE the deal, then definitely look at other NoSQL servers.
  • If you’re fine with a cluster of servers where a master is allowed to write and the slaves will be eventually consistent, great, go ahead. If you need all your slaves need to be always in sync, then MongoDB does not fit.

What you should be aware of

Here’s some general hints of coming from direct experience.

  • They tell you MongoDB is RAM / disk hungry and you reply: “dude, disk and RAM are cheap now”. Reconsider. You have no freaking idea of how hungry it is.
    RAM: it maps the database to the RAM memory, meaning it keeps pages of data in RAM, as much as it can. This is great because if you need data that is currently loaded in memory, it is blazing fast; on the other hand, if you need to load stuff that is not in memory, it will need to swap pages in memory, which is not great at all. Disk is cheap alright, but a solid state drive will make a lot of difference in this scenario. How much RAM will it require? As much as you can provide, it’s basically bottomless.
    Disk: to make things more difficult, even the disk hunger grows in a gargantuan fashion. Now be aware that any database meant to deal with big data will grow on disk like a family of gremlins in a swimming pool, and MongoDB is a total pro in this. It will preallocate and organize a lot of disk just to make the I/O ops to perform better, and stay assured it will use it.
    Note: I’ve recently read the news about MongoDB 3.0 and it looks like it now employs a technology that vastly reduces this hunger. I haven’t had the time to try it, but I will keep you posted.
  • All these degrees of freedom will make you forget how relevant it is to be accurate in data design. One day when the data will reach an important size, one of your queries will eventually take 5 minutes to produce acceptable results, saturate your disk I/O and potentially kill your system.
    What happened? That query is not picking up one of your indexes and it’s going straight to the disk, reading the whole goddamn database. Don’t underestimate this event: it could turn into a showstopper. No matter how flexible MongoDB feels like, data modeling is still a thing, and no, your queries shouldn’t deal with anything but indexes, unless it’s impossible to do so. Of course MongoDB works great for prototyping and you might even avoid thinking about the indexing thing, but do not even dare to go live without a detailed analysis.
    Good design aside, my suggestion is to take your time to analyze every query you do with the almighty explain command that will tell you exactly what is the effort for MongoDB to perform a search / sort for you.
  • Evaluate data transfer. It is generally a good idea in any system, but these are big ass documents and if your query will transfer tons of them from the DB to the application, see how that will impact your setup. This problem (which doesn’t necessarily become a problem for you) can be very tedious, especially in a highly distributed computing environment where the queries are performed all over the world over a semi-centralized data store. There certainly is a solution to mitigate this problem, but it has a price. To conclude, if you’re going to retrieve those fatty documents just to look into a field or two and determine whether they’re what you’re looking for or not, maybe you need to review how you’re doing it.

Other interesting features

We’ve said what it’s good at, and what is not. But there are a few other things that can make MongoDB ideal for your needs.

  • Indexing. Oh what now, again? Yeah, AGAIN. I told you indexing is awesome, but there’s something to add! MongoDB features full text search indexes on fields. Not too advanced, nothing compared to what you’d want to do with, say, Solr, but it pretty much covers a number of scenarios.
  • Aggregation. As we previously stated, with the numbers you might find yourself dealing with, aggregating data right in the database before the data actually hits your app, could make a huge difference. In certain scenarios, it might make the difference between making it, or making it blow up. Aggregation API allows you to select fields, group, count and perform basic calculations to generate a new dataset that represents aggregated information. It might not be the fastest thing in the world, I know, but it does work and if your docs are a lot and pretty big, it is possible it could save you.
  • MapReduce. Now, I hate this buzz about how “MapReduce can solve world famine”. I won’t get into what MapReduce is in detail, but to cut it short, it’s a technique to write long running processes which analyze lots of data to produce calculated information. There are frameworks out there doing this, and they’re basically the ones allowing you to search on Google, or have a simulacrum of a social life on Facebook. Just like the full text search thing, though, MongoDB comes with a simple MapReduce implementation built in! I never needed it so far, but it does solve a number of problems, and having it there in MongoDB is pretty awesome.
  • Replication. It’s a classic model where a master replicates on a number of slaves. If the master goes down, the slaves proceed to an election of a new master. Reads can be performed on the all slaves, while writes need to go to the master. What’s really good about the creation of a MongoDB cluster is a very simple operation, no rocket science here.
  • Sharding. MongoDB has a sharding functionality, but unfortunately I’m not an expert in it, so I suggest you to look into that yourself, if that’s a thing for your project.

Things to evaluate during software selection

We’re there, finally. This chapter basically summarizes what we’ve said so far.

  • Who is going to use it: ops? devOps? MongoDB is extremely easy to manage, it does not require a guru
  • What data will fit in there: remember, big documents with a need to be indexed, with weak relationships with other pieces of data
  • What I/O you will have: lots of reads, a reasonable number of writes
  • How your cluster will distribute: one master and multiple slaves, eventually consistent
  • What resources you will assign to it: lots of them. And the disk has to be fat and fast
  • How you are going to query it: brutally, but only on indexes
  • How you are going to design your data: prototyping can be loose, but you must come up with a great, solid plan before going live
  • How it will resist to change: it basically doesn’t give a f**k, it’s all up to you

Conclusions

As all trending technologies, you will find a lot of people out there saying it sucks, and lots of people saying it rocks. But most people whine.
MongoDB is like Bon Jovi. It’s fun, does the job, but if you expect it to be heavy metal you will probably end up throwing up. You can’t be cute and a true headbanger at the same time.
There’s nothing in this world that can help you avoiding compromises or the need to write ad hoc code to bypass the limitations of the tools you use (or a flaw in the requirements).
With that said, though, you should fight your desire to solve all your problems with one technology, and see what MongoDB is actually good at. And in what it’s good at, MongoDB is a beast.
Moreover, the hype this database created, not only boosted adoption but had MongoDB grow in quality and features with a determination that I consider rare and precious in this field. Lots of things happened in these four years, and the product developed features that were unexpected when I started using it.

Would I recommend MongoDB? Hell yes.

Advertisements