PostgreSQL JSONB type – first time hands on!

Now when you think you knew something about databases…

From relational to NoSQL, from MongoDB to Cassandra to Elastic… there’s been quite a lot going on. Or better, you know you’re getting old when all these events are all packed close in your memory. But you’re OK with it and understand how this new world has its own rules, where the most important one is:

Each one of these, is very good at some things, and very bad at other things

Then something happens and you don’t know what to think about it: your life companion PostgreSQL has a structured data column type.

It’s like when your wife says “I’m pregnant” and you’re not really sure if you’re like: “I’m gonna be a father”, laugh, cry, buy a ticket to Cuba, “it must be someone else’s”, summarize in your head the gazillion of problems you’re gonna face, hug your wife.

And if you think this comparison is excessively nerd, you gotta spend some time with a developer’s wife/husband and hear from her/him.

Anyway, to understand how, you better figure out why (I’m talking about databases now…), and to do it, I realized I should have played with it to have an opinion.

Getting started

Well first off make sure you have at least PostgreSQL 9.4 installed. On my Ubuntu Linux LTS 14.04 I had to upgrade it by adding another APT repository. Also keep in mind that when upgrading like this, you’re not actually upgrading but adding another instance on the side and they will both run, somehow.

Next, launch your psql client. If you’re completely new to PostgreSQL (I hope not, because I’m definitely not going to help you with the basics), you might find easier to su to the postgres user and continue from there. But again, if you’re completely new to it, I suggest you to start from here http://www.postgresql.org/docs/9.4/static/tutorial-start.html .

Once you’re in, create a test database and connect to it:

CREATE DATABASE testj
\c testj

Table definition

PostgreSQL ain’t MongoDB. We’re still in a relational database, we still have tables, we still have schemas. This JSONB thing in PostgreSQL is a column type. To better understand how we can benefit from this data type, I’d like to use the very first example that came into my mind.
When creating an application with registered users, you know exactly what a user is about (email address, full name, country etc.) but based on what your application is going to do, you might have a number (at first finite, then… who knows) of other properties related to the user that is going to be fluid, change over time and might even be customized by users themselves. Access privileges, for instance.
Until now, you had four chances.

  1. Alter the table to add columns when needed. Meh…
  2. Create a “properties” table and join when needed. Better but still meh…
  3. If you’re already using a document database, create a document of properties for each user. Seen and done, but boring and capable of creating lots of garbage.
  4. Store a serialized structured data in one column as text. Acceptable, but not funny, as there’s no way to query the database for particular properties and you will need to deserialize each time…

This JSONB column seems to be  doing just fine for this simple purpose.

CREATE TABLE myuser (
                     id         SERIAL PRIMARY KEY,
                     full_name  TEXT   NOT NULL,
                     email      TEXT   NOT NULL,
                     properties JSONB  NOT NULL);

As you can see, this is a usual create table, with a twist: the JSONB column.

Inserting

We know enough to do this, by now:

INSERT INTO myuser
           (full_name,email,properties)
           VALUES ('Charlie Sheen',
                   'charlie@thisisatest.com',
                   '{"privilege":5,"warehouse_access":true}');

What’s really interesting here is the properties field is inserted as you would do with text, but the content is actually evaluated. Try to introduce a syntax error in the JSON and you will see…

Let’s query that to learn more:

SELECT * FROM myuser;

 id |   full_name   |          email          |                 properties                 
----+---------------+-------------------------+--------------------------------------------
  1 | Charlie Sheen | charlie@thisisatest.com | {"privilege": 5, "warehouse_access": true}

Pretty clean, but still not a big difference from a text column.

Querying

Now let’s see why this is better than storing a piece of text.
The very first thing to worry me when it comes to storing big stuff into a database is: if I need to retrieve a subset of fields from a big number of big rows, how much junk is going to go through the wire?
You know this is not a problem with SQL servers, as the SELECT clause allows you to finely choose what you need.
In other systems, such as MongoDB, to do so you need to use a specific API (aggregation).
This is a very important topic because on heavy duty systems this can cause you a lot of trouble. Other than the traffic on the wire, you need to realize your software will have to parse the data once it makes it into the system. I have a few terrible stories where 80% of the query processing time was taken by the software trying to represent the received data.

Say I want to select the privilege level of users (I’ve added another user in the meantime):

SELECT id,properties->'privilege' AS privilege FROM myuser;

 id | privilege 
----+----------
  1 | 5
  2 | 2

Say I want to know how many people have access to the warehouse:

SELECT COUNT(*), properties->'warehouse_access' AS warehouse_access
                  FROM myuser
                  GROUP BY properties->'warehouse_access';

 count | warehouse_access 
-------+------------------
     1 | true
     1 | false

Say I want to have the list of people having access to the warehouse:

SELECT * FROM myuser WHERE properties->>'warehouse_access'='true';

 id | full_name     | email                   | properties 
----+---------------+-------------------------+--------------------------------------------
 1  | Charlie Sheen | charlie@thisisatest.com | {"privilege": 5, "warehouse_access": true}

Or you can decide to match some document properties (in a Mongoish way) by:

select * FROM myuser WHERE properties @> '{"warehouse_access":true,"privilege":5}';

The syntax look a bit upsetting, but the thing is they decided to create multiple operators you can find out here.

Transactions and atomicity

This is probably one of the most intriguing things about this feature.
Transactions anyone? YES. As I previously said, PostgreSQL didn’t change its intimate nature, therefore transactions are part of its daily routine. And as the JSONB typed column is “just a column”, transactions apply as usual, in other words, across the database.

Atomicity is a whole different thing. In PostgreSQL 9.4 you basically can’t change part of your JSONB, so -back to our example- if you need to to inhibit the access to the warehouse to Charlie, you will have to:

UPDATE myuser SET properties='{"privilege":5,"warehouse_access":false}' WHERE id=1;

To be fully honest, I see the logic underneath. As you can’t change “part of a cell”, then JSONB is not an exception.
With that said, though, PostgreSQL 9.5 introduced the jsonb_set command that does just that, but given the upgrade madness I previously mentioned, I have no intention to upgrade again for a while, so bear with me if I didn’t give it a try.

Indexing

Ah here we are. Noblesse oblige.
We have structured data, now what? If you’ve been playing with NoSQL software long enough, you know that indexing is something you will have to deal with sooner or later, and experience plays a very relevant role in this design phase.
In relational databases you are mostly like “yeah right, will do”, if your data is not meant to be huge.

Now, if you’re going to use PostgreSQL as usual, PLUS you’re including the JSONB thing to store extra stuff, fine go ahead. But if you’re going to query that JSONB, then I strongly suggest you to read this very well done speed test which undoubtedly points out how querying structured data without an index is slower than MongoDB (while it looks faster in all other situations).

Indexing is… kinda weird.
Back to our example, I can use the following statement:

CREATE INDEX propsindex ON myuser USING GIN (properties);

This will get the WHOLE properties column indexed, so your queries will take advantage of it.

SELECT * FROM myuser WHERE properties @> '{"warehouse_access":true}';

If you’re asking yourself “is this… going to put the whole JSON in an index?” the answer is yes, and might not be what you want is a vast number of situations. Also, unless you’re gonna do hardcore random searching, this is just wasted index space, and wasting indexing space means degrading performance.
Don’t despair though, here’s a way:

CREATE INDEX windex ON myuser USING GIN (properties -> 'warehouse_access');

This solution also offers many other advantages that are out of scope for this hands on article.

Conclusion

First off, I’m pretty new to this feature, so what I’m going to write here are just opinions and sparse thoughts.

Let me be straight forward: I think you can create wonderful disasters with JSONB.
Normal forms are far from being the panacea to every problem, and the multitude of NoSQL, document based databases demonstrate it in many scenarios. But normal forms are a synonymous of order, symmetry, something that makes us control freaks feel comfortable. Until we shatter our balls in endless queries with nested joins and ridiculous performance for no obvious reason.

Now, the following list of DON’Ts will seem a bit dramatic especially when I say you shouldn’t do things the system is actually capable of, but this pretty much relates to 2 software development rules of mine:

1) prepare for the worst: how many requests did you say? 100 per minute? How will it behave if they are 100 per second? How much HDD space? How about 10 times?

2) use the best tool for the job: if you can choose something that does great, don’t choose something else because you’re a lazy ass.

So here it is:

  • If your plan is to make aggregated operations, such as grouping or counting etc. on your structured data as a heavy duty feature, please, don’t, they won’t perform too well, and there are other tools very good at that.
  • If you plan to store unorganized random stuff in that column, don’t if you’re going to search them aggressively. And it’s a bad database design, period.
  • If you plan to massively index full JSONB columns, don’t if you don’t want your indexes to grow immensely and kill performance.
  • If you plan to use PostgreSQL as document store, don’t, it’s not.
  • If your project requirements don’t explicitly say you should use PostgreSQL or a database of your choice, postponing and delegating the decision, don’t, you might find yourself with a database design that isn’t compatible with other systems.

But if:

  • your rows are going to include beautifully organized columns used as indexes PLUS structured data to be used as a whole
  • … or your JSONB column is actually very well organized with properly indexed fields
  • you’re going to use your the structured data as catalog feature (an array of tags, for example) or any reasonable use that will enrich the capabilities of your table
  • you’re going to take advantage of the database-wide transactional peculiarity
  • your application is not going to constantly update them as a main operation

Then you probably met the philosopher stone and I think this feature will greatly improve your work and your database design.

And database design is the key. Having a column like JSONB doesn’t mean you can’t postpone key decisions on how your database is going to work, and this applies to all databases. Even the freedom of MongoDB, for instance, is apparent. Don’t take key decisions today and you’re going to regret them tomorrow.
Of course you don’t have a crystal ball and certainly you won’t be able to tell how the database is going to evolve, but this doesn’t mean you’re allowed to go straight, when you’re allowed to turn left or right.

With that said, though, I see a structured data column as a powerful tool that extends my options in designing a good database, empowering me to let some features evolve without re-engineering the database or being forced into paths that are boring and frustrating. Introducing approaches that makes my design stronger and efficient.