PostgreSQL JSONB type – first time hands on!

Now when you think you knew something about databases…

From relational to NoSQL, from MongoDB to Cassandra to Elastic… there’s been quite a lot going on. Or better, you know you’re getting old when all these events are all packed close in your memory. But you’re OK with it and understand how this new world has its own rules, where the most important one is:

Each one of these, is very good at some things, and very bad at other things

Then something happens and you don’t know what to think about it: your life companion PostgreSQL has a structured data column type.

It’s like when your wife says “I’m pregnant” and you’re not really sure if you’re like: “I’m gonna be a father”, laugh, cry, buy a ticket to Cuba, “it must be someone else’s”, summarize in your head the gazillion of problems you’re gonna face, hug your wife.

And if you think this comparison is excessively nerd, you gotta spend some time with a developer’s wife/husband and hear from her/him.

Anyway, to understand how, you better figure out why (I’m talking about databases now…), and to do it, I realized I should have played with it to have an opinion.

Getting started

Well first off make sure you have at least PostgreSQL 9.4 installed. On my Ubuntu Linux LTS 14.04 I had to upgrade it by adding another APT repository. Also keep in mind that when upgrading like this, you’re not actually upgrading but adding another instance on the side and they will both run, somehow.

Next, launch your psql client. If you’re completely new to PostgreSQL (I hope not, because I’m definitely not going to help you with the basics), you might find easier to su to the postgres user and continue from there. But again, if you’re completely new to it, I suggest you to start from here .

Once you’re in, create a test database and connect to it:

\c testj

Table definition

PostgreSQL ain’t MongoDB. We’re still in a relational database, we still have tables, we still have schemas. This JSONB thing in PostgreSQL is a column type. To better understand how we can benefit from this data type, I’d like to use the very first example that came into my mind.
When creating an application with registered users, you know exactly what a user is about (email address, full name, country etc.) but based on what your application is going to do, you might have a number (at first finite, then… who knows) of other properties related to the user that is going to be fluid, change over time and might even be customized by users themselves. Access privileges, for instance.
Until now, you had four chances.

  1. Alter the table to add columns when needed. Meh…
  2. Create a “properties” table and join when needed. Better but still meh…
  3. If you’re already using a document database, create a document of properties for each user. Seen and done, but boring and capable of creating lots of garbage.
  4. Store a serialized structured data in one column as text. Acceptable, but not funny, as there’s no way to query the database for particular properties and you will need to deserialize each time…

This JSONB column seems to be  doing just fine for this simple purpose.

                     id         SERIAL PRIMARY KEY,
                     full_name  TEXT   NOT NULL,
                     email      TEXT   NOT NULL,
                     properties JSONB  NOT NULL);

As you can see, this is a usual create table, with a twist: the JSONB column.


We know enough to do this, by now:

           VALUES ('Charlie Sheen',

What’s really interesting here is the properties field is inserted as you would do with text, but the content is actually evaluated. Try to introduce a syntax error in the JSON and you will see…

Let’s query that to learn more:

SELECT * FROM myuser;

 id |   full_name   |          email          |                 properties                 
  1 | Charlie Sheen | | {"privilege": 5, "warehouse_access": true}

Pretty clean, but still not a big difference from a text column.


Now let’s see why this is better than storing a piece of text.
The very first thing to worry me when it comes to storing big stuff into a database is: if I need to retrieve a subset of fields from a big number of big rows, how much junk is going to go through the wire?
You know this is not a problem with SQL servers, as the SELECT clause allows you to finely choose what you need.
In other systems, such as MongoDB, to do so you need to use a specific API (aggregation).
This is a very important topic because on heavy duty systems this can cause you a lot of trouble. Other than the traffic on the wire, you need to realize your software will have to parse the data once it makes it into the system. I have a few terrible stories where 80% of the query processing time was taken by the software trying to represent the received data.

Say I want to select the privilege level of users (I’ve added another user in the meantime):

SELECT id,properties->'privilege' AS privilege FROM myuser;

 id | privilege 
  1 | 5
  2 | 2

Say I want to know how many people have access to the warehouse:

SELECT COUNT(*), properties->'warehouse_access' AS warehouse_access
                  FROM myuser
                  GROUP BY properties->'warehouse_access';

 count | warehouse_access 
     1 | true
     1 | false

Say I want to have the list of people having access to the warehouse:

SELECT * FROM myuser WHERE properties->>'warehouse_access'='true';

 id | full_name     | email                   | properties 
 1  | Charlie Sheen | | {"privilege": 5, "warehouse_access": true}

Or you can decide to match some document properties (in a Mongoish way) by:

select * FROM myuser WHERE properties @> '{"warehouse_access":true,"privilege":5}';

The syntax look a bit upsetting, but the thing is they decided to create multiple operators you can find out here.

Transactions and atomicity

This is probably one of the most intriguing things about this feature.
Transactions anyone? YES. As I previously said, PostgreSQL didn’t change its intimate nature, therefore transactions are part of its daily routine. And as the JSONB typed column is “just a column”, transactions apply as usual, in other words, across the database.

Atomicity is a whole different thing. In PostgreSQL 9.4 you basically can’t change part of your JSONB, so -back to our example- if you need to to inhibit the access to the warehouse to Charlie, you will have to:

UPDATE myuser SET properties='{"privilege":5,"warehouse_access":false}' WHERE id=1;

To be fully honest, I see the logic underneath. As you can’t change “part of a cell”, then JSONB is not an exception.
With that said, though, PostgreSQL 9.5 introduced the jsonb_set command that does just that, but given the upgrade madness I previously mentioned, I have no intention to upgrade again for a while, so bear with me if I didn’t give it a try.


Ah here we are. Noblesse oblige.
We have structured data, now what? If you’ve been playing with NoSQL software long enough, you know that indexing is something you will have to deal with sooner or later, and experience plays a very relevant role in this design phase.
In relational databases you are mostly like “yeah right, will do”, if your data is not meant to be huge.

Now, if you’re going to use PostgreSQL as usual, PLUS you’re including the JSONB thing to store extra stuff, fine go ahead. But if you’re going to query that JSONB, then I strongly suggest you to read this very well done speed test which undoubtedly points out how querying structured data without an index is slower than MongoDB (while it looks faster in all other situations).

Indexing is… kinda weird.
Back to our example, I can use the following statement:

CREATE INDEX propsindex ON myuser USING GIN (properties);

This will get the WHOLE properties column indexed, so your queries will take advantage of it.

SELECT * FROM myuser WHERE properties @> '{"warehouse_access":true}';

If you’re asking yourself “is this… going to put the whole JSON in an index?” the answer is yes, and might not be what you want is a vast number of situations. Also, unless you’re gonna do hardcore random searching, this is just wasted index space, and wasting indexing space means degrading performance.
Don’t despair though, here’s a way:

CREATE INDEX windex ON myuser USING GIN (properties -> 'warehouse_access');

This solution also offers many other advantages that are out of scope for this hands on article.


First off, I’m pretty new to this feature, so what I’m going to write here are just opinions and sparse thoughts.

Let me be straight forward: I think you can create wonderful disasters with JSONB.
Normal forms are far from being the panacea to every problem, and the multitude of NoSQL, document based databases demonstrate it in many scenarios. But normal forms are a synonymous of order, symmetry, something that makes us control freaks feel comfortable. Until we shatter our balls in endless queries with nested joins and ridiculous performance for no obvious reason.

Now, the following list of DON’Ts will seem a bit dramatic especially when I say you shouldn’t do things the system is actually capable of, but this pretty much relates to 2 software development rules of mine:

1) prepare for the worst: how many requests did you say? 100 per minute? How will it behave if they are 100 per second? How much HDD space? How about 10 times?

2) use the best tool for the job: if you can choose something that does great, don’t choose something else because you’re a lazy ass.

So here it is:

  • If your plan is to make aggregated operations, such as grouping or counting etc. on your structured data as a heavy duty feature, please, don’t, they won’t perform too well, and there are other tools very good at that.
  • If you plan to store unorganized random stuff in that column, don’t if you’re going to search them aggressively. And it’s a bad database design, period.
  • If you plan to massively index full JSONB columns, don’t if you don’t want your indexes to grow immensely and kill performance.
  • If you plan to use PostgreSQL as document store, don’t, it’s not.
  • If your project requirements don’t explicitly say you should use PostgreSQL or a database of your choice, postponing and delegating the decision, don’t, you might find yourself with a database design that isn’t compatible with other systems.

But if:

  • your rows are going to include beautifully organized columns used as indexes PLUS structured data to be used as a whole
  • … or your JSONB column is actually very well organized with properly indexed fields
  • you’re going to use your the structured data as catalog feature (an array of tags, for example) or any reasonable use that will enrich the capabilities of your table
  • you’re going to take advantage of the database-wide transactional peculiarity
  • your application is not going to constantly update them as a main operation

Then you probably met the philosopher stone and I think this feature will greatly improve your work and your database design.

And database design is the key. Having a column like JSONB doesn’t mean you can’t postpone key decisions on how your database is going to work, and this applies to all databases. Even the freedom of MongoDB, for instance, is apparent. Don’t take key decisions today and you’re going to regret them tomorrow.
Of course you don’t have a crystal ball and certainly you won’t be able to tell how the database is going to evolve, but this doesn’t mean you’re allowed to go straight, when you’re allowed to turn left or right.

With that said, though, I see a structured data column as a powerful tool that extends my options in designing a good database, empowering me to let some features evolve without re-engineering the database or being forced into paths that are boring and frustrating. Introducing approaches that makes my design stronger and efficient.


4 years with MongoDB

mongodbWhen I met MongoDB, I still had, seared in my mind, the image of a contractor completely destroyed by the failed attempt to do something similar to what I had been asked to do.
Among a number of questionable decisions, the contractor took a road that smelled much like “tradition”, rather than innovation: MySQL.
Now let’s be clear, I’ve nothing against MySQL or relational databases in general. I still believe they’re the very best solution in most “common” situations. But the challenge we’re talking about was way beyond the reasonable capabilities of a relational database.
In a context where every data structure is almost completely fluid and can contain various levels of nested data, a classic approach made up of hundreds of tables and joins simply wasn’t the way to go, and among a variety of NoSQL databases, a document store was the most reasonable choice.

It’s not that MongoDB is a better database, but it certainly is a better choice for some specific purposes

Keep this very clear in mind: contrary to what happens in the relational world where every piece of technology struggles to be a better implementation of an established model, in the NoSQL world every database radically differentiates its capabilities to deliver a limited subset of powerful features, strictly coupled with the philosophy of the data they’re going to store.

Which means, in other words: data first.

Analyze your problem, your data, the use you’re gonna make of it and then search the technology that best matches your use case.
At that time, MongoDB was roaring among developers, especially in the web world. From the information I gathered, MongoDB looked like the panacea for everything, the Tom Hanks that turns a poor script into an acceptable movie. But formal research and -later on- hands on experience showed this is far from being the truth. But for my use case? That looked astonishingly perfect.

What it is

MongoDB is a NoSQL (meaning that you don’t use SQL to query it) database that stores data in a quite different way, compared to what you probably are used to. Instead of going through a declarative phase, you simply throw stuff at it in what they call collections that I would ideally relate to tables. The stuff you throw at it are documents of data, nested at various levels. So remove from your mind the idea of tables, this is much more like an archive of files. Contrary to key / value stores, though, indexing in MongoDB is extremely advanced, allowing multiple, non-unique, composed indexes. By using these indexes, you can search, filter and sort according to virtually any need. As one would expect, drivers for any non-meaningless programming language are available. Data structures are represented as JSON objects and JavaScript is the language of choice when scripting within the MongoDB shell.

Why use it

This is probably the most sensitive question. If you’re looking for a database to replace your SQL database with something “more exotic” or “thought to be more performing”, well maybe you should go on reading and eventually reconsider. Among most NoSQL databases, MongoDB could potentially replace a relational database, but I strongly discourage this approach.
MongoDB is not, by any means, a SQL / relational database replacement for two humongous reasons: it doesn’t work with SQL and it is not relational. Captain Obvious in action.
Querying MongoDB is quite a different experience compared to SQL, though it’s definitely not something I would consider hard or requiring special training.
For the relational part… let’s say this: if you think you don’t need constraints, think twice, be absolutely sure. Managing relations between data items (as you would do in MongoDB to “replace” the relational feature) can be a bit of a drama in an application where data can be this complex without constraints. Any software / person using the database would have the duty to maintain consistency… To be honest: don’t. Relations between documents (and other documents or other information units) should be simple and weak. The system should be able to keep going if they break.
So why use it then?

  • If you are going to store a shitload of complex data where each item is almost self explanatory (documents) and you will, most of the time, query documents to digest them entirely, go ahead.
  • If you need to index your complex documents in various creative ways to retrieve them, count them etc. this works pretty much a lot. To be honest, this is one of the most powerful indexing systems I’ve seen around for this kind of purposes. If your indexing needs are simpler, maybe you should also evaluate other NoSQL servers that might fulfill other needs better.
  • If heavy inserts / edits / deletes are not the real deal of your app, great news because MongoDB is not so good at them. If, on the contrary, they ARE the deal, then definitely look at other NoSQL servers.
  • If you’re fine with a cluster of servers where a master is allowed to write and the slaves will be eventually consistent, great, go ahead. If you need all your slaves need to be always in sync, then MongoDB does not fit.

What you should be aware of

Here’s some general hints of coming from direct experience.

  • They tell you MongoDB is RAM / disk hungry and you reply: “dude, disk and RAM are cheap now”. Reconsider. You have no freaking idea of how hungry it is.
    RAM: it maps the database to the RAM memory, meaning it keeps pages of data in RAM, as much as it can. This is great because if you need data that is currently loaded in memory, it is blazing fast; on the other hand, if you need to load stuff that is not in memory, it will need to swap pages in memory, which is not great at all. Disk is cheap alright, but a solid state drive will make a lot of difference in this scenario. How much RAM will it require? As much as you can provide, it’s basically bottomless.
    Disk: to make things more difficult, even the disk hunger grows in a gargantuan fashion. Now be aware that any database meant to deal with big data will grow on disk like a family of gremlins in a swimming pool, and MongoDB is a total pro in this. It will preallocate and organize a lot of disk just to make the I/O ops to perform better, and stay assured it will use it.
    Note: I’ve recently read the news about MongoDB 3.0 and it looks like it now employs a technology that vastly reduces this hunger. I haven’t had the time to try it, but I will keep you posted.
  • All these degrees of freedom will make you forget how relevant it is to be accurate in data design. One day when the data will reach an important size, one of your queries will eventually take 5 minutes to produce acceptable results, saturate your disk I/O and potentially kill your system.
    What happened? That query is not picking up one of your indexes and it’s going straight to the disk, reading the whole goddamn database. Don’t underestimate this event: it could turn into a showstopper. No matter how flexible MongoDB feels like, data modeling is still a thing, and no, your queries shouldn’t deal with anything but indexes, unless it’s impossible to do so. Of course MongoDB works great for prototyping and you might even avoid thinking about the indexing thing, but do not even dare to go live without a detailed analysis.
    Good design aside, my suggestion is to take your time to analyze every query you do with the almighty explain command that will tell you exactly what is the effort for MongoDB to perform a search / sort for you.
  • Evaluate data transfer. It is generally a good idea in any system, but these are big ass documents and if your query will transfer tons of them from the DB to the application, see how that will impact your setup. This problem (which doesn’t necessarily become a problem for you) can be very tedious, especially in a highly distributed computing environment where the queries are performed all over the world over a semi-centralized data store. There certainly is a solution to mitigate this problem, but it has a price. To conclude, if you’re going to retrieve those fatty documents just to look into a field or two and determine whether they’re what you’re looking for or not, maybe you need to review how you’re doing it.

Other interesting features

We’ve said what it’s good at, and what is not. But there are a few other things that can make MongoDB ideal for your needs.

  • Indexing. Oh what now, again? Yeah, AGAIN. I told you indexing is awesome, but there’s something to add! MongoDB features full text search indexes on fields. Not too advanced, nothing compared to what you’d want to do with, say, Solr, but it pretty much covers a number of scenarios.
  • Aggregation. As we previously stated, with the numbers you might find yourself dealing with, aggregating data right in the database before the data actually hits your app, could make a huge difference. In certain scenarios, it might make the difference between making it, or making it blow up. Aggregation API allows you to select fields, group, count and perform basic calculations to generate a new dataset that represents aggregated information. It might not be the fastest thing in the world, I know, but it does work and if your docs are a lot and pretty big, it is possible it could save you.
  • MapReduce. Now, I hate this buzz about how “MapReduce can solve world famine”. I won’t get into what MapReduce is in detail, but to cut it short, it’s a technique to write long running processes which analyze lots of data to produce calculated information. There are frameworks out there doing this, and they’re basically the ones allowing you to search on Google, or have a simulacrum of a social life on Facebook. Just like the full text search thing, though, MongoDB comes with a simple MapReduce implementation built in! I never needed it so far, but it does solve a number of problems, and having it there in MongoDB is pretty awesome.
  • Replication. It’s a classic model where a master replicates on a number of slaves. If the master goes down, the slaves proceed to an election of a new master. Reads can be performed on the all slaves, while writes need to go to the master. What’s really good about the creation of a MongoDB cluster is a very simple operation, no rocket science here.
  • Sharding. MongoDB has a sharding functionality, but unfortunately I’m not an expert in it, so I suggest you to look into that yourself, if that’s a thing for your project.

Things to evaluate during software selection

We’re there, finally. This chapter basically summarizes what we’ve said so far.

  • Who is going to use it: ops? devOps? MongoDB is extremely easy to manage, it does not require a guru
  • What data will fit in there: remember, big documents with a need to be indexed, with weak relationships with other pieces of data
  • What I/O you will have: lots of reads, a reasonable number of writes
  • How your cluster will distribute: one master and multiple slaves, eventually consistent
  • What resources you will assign to it: lots of them. And the disk has to be fat and fast
  • How you are going to query it: brutally, but only on indexes
  • How you are going to design your data: prototyping can be loose, but you must come up with a great, solid plan before going live
  • How it will resist to change: it basically doesn’t give a f**k, it’s all up to you


As all trending technologies, you will find a lot of people out there saying it sucks, and lots of people saying it rocks. But most people whine.
MongoDB is like Bon Jovi. It’s fun, does the job, but if you expect it to be heavy metal you will probably end up throwing up. You can’t be cute and a true headbanger at the same time.
There’s nothing in this world that can help you avoiding compromises or the need to write ad hoc code to bypass the limitations of the tools you use (or a flaw in the requirements).
With that said, though, you should fight your desire to solve all your problems with one technology, and see what MongoDB is actually good at. And in what it’s good at, MongoDB is a beast.
Moreover, the hype this database created, not only boosted adoption but had MongoDB grow in quality and features with a determination that I consider rare and precious in this field. Lots of things happened in these four years, and the product developed features that were unexpected when I started using it.

Would I recommend MongoDB? Hell yes.