Wednesday, September 30, 2015

Mirror Mirror on the wall, Where's the data? In my Vol

When I first started working with docker last year, there was a clear pattern already out there the docker image itself only consists of application binary (and depending on the philosophy - the entire OS libraries that are required) and all application data goes in a volume.

Also the concept  called "Data Container" also seemed to be little popular at that time.  Not everyone bought into that philosophy and there were various other patterns emerging out then on how people used volumes with their docker containers.

One of the emerging pattern was (or still is)  "Data Initialization if it does not exist" during container startup.
Let's face it, when we first start a docker container consisting of say PostgreSQL 9.4 database the volume is an empty file system. We then do an initdb and setup a database so that it is ready to serve.
The simplest way is to  check if the data directory has data in it and if it does not have data, then run initdb and setup the most common best practices of the database and serve it up.

Where's the simplest place to do this? In the entrypoint script of docker container of course.

I did the same mistake in my jkshah/postgres:9.4 image too. In fact I still see that same pattern in the official postgres docker image also where it looks for PG_VERSION and if it does not exists then it runs initdb.

if [ ! -s "$PGDATA/PG_VERSION" ]; then
    gosu postgres initdb

    ...
fi

This certainly has advantages:
1. Very simple to code the script.
2. Great Out of the box experience - You start the container up - the container sets itself up and it is ready to use.

Lets look what happens next in real life enterprise usages.

We got in scenarios while the applications using such databases are running but they lost all data in it. Hmm what's going wrong here? The application is working fine, the database is working fine, but all data is like it was freshly deployed and not something that was running well for 3-5 months.

Let's look at various activities that an enterprise will typically do with such a data volume - file system on the host where PostgreSQL containers are running.
1. The host location of the volume itself will be a mounted file system coming off  SAN or some storage device.
2. Enterprise will be backing up that file system on periodic intervals
3. On some cases they will be restoring that file system when required.
4. Sometimes the backend storage may have hiccups. (No ! That does not happen :-) )

In any of the above cases, where a mount fails or mounts a wrong file system or if the restore fails, you could end up with an empty file system for a volume path.  (Not all people had checks for this)

Now when you start the PostgreSQL docker container on such a volume you will get a new  database fully initialized. Most current automations that I have seen works such that in those cases even the application will fully initialize the database with its own schema and initial data and the application moves on like nothing is wrong here.

In the above case it might seem that the application is working to all probes till a customer tries to login into the setup  and find that they do not exist in the system .

For DBAs the anal rule is "No Data" error is better than "Wrong/Lost Data" serviced out of a database (specially PostgreSQL users). For this reason, this particular pattern of database initialization is becoming an ANTI Pattern in my view specially for docker containers. A better approach is to have an entrypoint command specifically to do a setup(initialization) knowingly and then all subsequent starts should be called with another entrypoint command to specifically fail if it does not find the data.

Of course again this is a philosophical view on how it should be handled. I would love to hear what people have to say about this.


No comments: