NoSQL and CMS – Comparing NoSQL with CPS Requirements

Now that we have both our CPS requirements mapped out, and a reasonable understanding of the NoSQL Universe, we can finally get to the fun stuff: mapping requirements to technologies.

For this kind of exercise I’m quite partial to ye olde comparison matrix – a table with the alternatives listed horizontally (in our case that’s the different classes of NoSQL technology) and with comparison criteria listed vertically (in our case our CPS requirements). You then play a “fill in the blanks” game for all of the cells in the table.

While simple and effective, comparison matrices have a tendency to be rather information-dense – for that reason I’ve elected to use boolean yes / no values in each cell, rather than more granular scoring systems (such as numerical scales). If we were doing a more thorough comparison (say, if we were actually implementing a CPS on top of a NoSQL technology) I would want to use a variety of more granular scoring systems.

Let’s dive in:

Key / Value BigTable-like Document Graph Relational (for comparison)
1 Richly structured content types 1 1
2 Unstructured binary objects 2 3
3 Relationships / references / associations
4 Schema evolution4 5
5 Branch / merge
6 Snapshot-based versioning
7 ACID transactions6 7
8 Scalability to large content sets 8
9 Geographic distribution 9
REQUIREMENTS MET 4 4 5 5 4

One of the nice things about comparison matrices is that they immediately highlight the differentiating criteria for any given comparison, allowing one to focus the discussion on those areas where compromises will be necessary.

Looking at the matrix above, the eye is immediately drawn to rows 1, 2 and 3, which is where the variability between the different classes of NoSQL is found. In fact what this immediately tells us is that there’s no “perfect” class of NoSQL for our CPS requirements – none of them satisfy all three of the following requirements:

  • Richly structured content types
  • Unstructured binary objects
  • Relationships / references / associations

What’s worse are rows 5 and 6, and (to a lesser extent) row 9. These rows indicate that there are entire requirements that no extant NoSQL technology satisfies.

In looking at these 3 requirements in isolation, there is an obvious class of systems that do satisfy them – Source Code Management (SCM) systems, especially Distributed SCMs (Mercurial, Bazaar, git and the like). It is important to stress that DSCMs don’t meet many of the other CPS requirements however (for one thing the “data model” they expose is a filesystem, which is completely unsuited for CPS content modeling), so they’re by no means an appropriate CPS repository in and of themselves.

The conclusion I’ve drawn is that the ideal CPS repository would be a three-way love child between 2 classes of NoSQL solution (document and graph databases) and DSCM. Specifically:

  • Document database style data model (satisfies requirements #1, #2, #4 and #7) plus Graph database style relationships (satisfies requirement #3)
  • SCM style versioning and branch/merge (satisfies requirements #5 and #6)
  • DSCM or CouchDB style multi-master replication and conflict detection (satisfies requirement #9)

I think of this hybrid as a “Document-Relational10 Version Control System” (DRVCS for the acronym junkies).

All three kinds of system meet requirement #8 when used in isolation, and a cursory analysis doesn’t reveal any obvious reasons why the DRVCS couldn’t retain this characteristic (although the devil is often in the details when it comes to questions of scale).

The bigger questions here are whether anyone is considering implementing such a repository, and if not whether it’s possible to emulate the missing requirements on top of any of the extant NoSQL technologies.

To the first point, I’m not aware of any efforts aimed at implementing a DRVCS, although I’m by no means familiar with all of the repository development going on out there. I’m also sure I’ll receive comments that such-and-such a technology already meets all of these requirements. I’m skeptical that such a beast actually exists, but would love to be proven wrong – what would convince me is the addition of the technology to the comparison matrix above, with a brief description in each of the new column’s cells describing how the technology satisfies each requirement.

I’ve also spent a bit of time thinking about whether it would be possible to build out the missing features on top of an extant NoSQL technology, and at this stage my gut tells me that the branch / merge and snapshot based versioning requirements would be difficult (and probably impractical) to build on top of any of these technologies. These features are technically complex enough that they deserve first class support in the repository, rather than being implemented as an “emulation” layer on top of something else.

Next Up

Next up I’ll be reviewing the repository-level requirements for Presentation Management CMSes, and we’ll repeat this comparison process to see how NoSQL technologies stack up against that use case. Fingers crossed for a more positive outcome!



1 Standard practice in graph and relational databases is to decompose complex objects into separate vertexes / rows with connecting edges / foreign keys. I don’t particularly like this approach since it confuses true inter-object relationships (“links” between otherwise independent content items) with the internally rich data structures of those content items. With a relational databases you can fake it by using 1-M foreign key constraints with CASCADE DELETE, although I see that as a weak substitute for “real” richly structured content types. Both relational and graph databases also impose “reconstitution costs” (joins) when an object is retrieved in its entirety – document databases don’t have this issue (this issue is nicely described in The Future of Databases).
2 While it is possible to store large binary objects in most graph databases, some (notably Neo4J) don’t recommend doing so.
3 While it is possible to store large binary objects in most relational databases (via LOB data types), this is generally frowned upon as relational databases typically aren’t very efficient at handling them and there are no standard ways to perform certain types of I/O within a single binary object (e.g. pulling out particular byte ranges, random I/O, etc.). In fact I was very close to putting a cross in this cell in the matrix – there are very good reasons why most CMSes that use a relational database choose to store binaries outside in a “real” filesystem.
4 Virtually all of the NoSQL classes mentioned here support Schema Evolution but not directly – instead they are “schemaless”, which means they don’t impose or require that the data model is declared to the repository up front. This is in fact an ideal way to support schema evolution, by allowing higher level CMS-specific logic to decide what a “schema” is, and how schema evolution will be supported, without the underlying persistence mechanism getting in the way. Compare this to the equivalent operation in a relational database (while recalling the good old days of multi-day “ALTER TABLE” marathons!) and you’ll quickly appreciate how liberating a schemaless repository can be.
5 Typical BigTable-like systems aren’t fully schemaless, in that some form of up front “data modeling” is required, and once instances of that schema exist the options for modifying those models are limited.
6 Most NoSQL solutions only support ACIDity in predefined ways – they don’t typically allow external logic to programmatically demarcate arbitrary transactional boundaries. For example document databases typically only support ACIDity for a single document at a time, but not for batches of documents (note that the graph database Neo4J is a notable exception to this general trend). In my opinion this is perfectly adequate for most CMS uses cases, including the CPS use case.
7 MongoDB’s default behaviour is not what is traditionally thought of as ACID – in particular consistency and durability are relaxed in order to improve performance. That said MongoDB provides mechanisms for the implementer to increase both consistency and durability (at the cost of performance) – these facilities are described in this great guide.
8 Many in the NoSQL movement would tout this as the primary limitation of relational databases, however recall that for the Content Production use case, the scalability to large traffic volumes is not a requirement (an editorial team cannot generate anything approaching an internet scale traffic load), and the data volumes are not necessarily beyond what a relational database can handle either (a typical web site has 10s to 100s of thousands of discrete content items, and when stored in a relational data model that would typically equate to one or two orders of magnitude more rows). Some might argue I’m being somewhat lenient on relational databases in this analysis, but there are reasons they’ve enjoyed such widespread adoption for so long, beyond the momentum / familiarity arguments.
9 CouchDB stands out in this regard, with exceptional support for geographical distribution. It achieves this by offering a replication mechanism that can be used to create multi-master topologies with automatic conflict detection (so that concurrent updates in different places are detected, and can be resolved by higher level logic – in our case that would be done by the hypothetical CPS logic that’s sitting on top of CouchDB). CouchDB’s replication functionality is, in my opinion, a perfect fit for this requirement, and we’ll be hearing more about it in my upcoming post on NoSQL in the context of Document Management CMSes.
10 “Relational” isn’t the most accurate term to describe the “relationship / reference / association” part of this hypothetical system, but I haven’t come up with a better alternative.

Share via del.icio.us Share via Digg Share via Facebook Bookmark in Google Share via MySpace Share via Reddit Share via StumbleUpon Favourite in Technorati Share via Twitter

About these ads

The URI to TrackBack this entry is: http://contentcurmudgeon.wordpress.com/2010/07/12/nosql-and-cms-comparing-nosql-with-cps-requirements/trackback/

RSS feed for comments on this post.

5 CommentsLeave a comment

  1. I dont think combining say a graph database with a key value store is very hard, to make a combination that covers 1,2 and 3 well – you could just run both in theory, although managing it would be a bit messy and you do want your reliability infrastructure shared.

    Also DVCS systems are not *very* filesystem oriented – underneath git versions a set of objects, with some hooks to support filesystem objects such as directories, but it does not have much about filesystems built in.

    So testing with combining existing systems might be a useful experiment to see what kind of API really works.

    • Yeah the “DRVCS” idea I’ve laid out here is really the holy grail of CPS repositories, as I see it. If / when you’re doing this “for real” you’re entering practical realities land – if you were using any of the above technologies as they currently exist, compromise enters the picture and at that point integrations of some of the above technologies become worthy of consideration.

      The obvious bug bear to any kind of integration is whether you can give up on ACIDity for all of those write operations that span multiple technologies (since most of the NoSQL technologies don’t support arbitrary ACID transactions, let alone distributed transactions that span other repositories). Having lived through this problem a number of times it’s not necessarily a trivial problem to mitigate – I would want to identify all of the primitive operations that must be ACID and see whether they can be limited to a single NoSQL system to at least bring it into the realms of feasibility. Alternatively you might just accept basic inconsistencies as part of the NoSQL world view, and code defensively for it in the CPS logic.

      As a side note many people are doing these kinds of integrations (albeit for other use cases) – there’s lots of chatter about things like CouchDB / Lucene integrations, enhancing Lucene to use Cassandra as a storage engine, even emulating CouchDB using git and bash (unfortunately the original article appears to have been deleted, but I recall reading it at the time and thinking “that’s a clever hack!”).

      I’m intrigued to dive under the covers and learn more about git’s internal storage model, particularly given the emulated CouchDB mentioned above (which proves that other “NoSQL”-like attributes are possible on top of it). That said I’m a little cautious about git given Google’s analysis when they were considering it for inclusion in their Google Code service. I recall reading another presentation (which I believe was from Google I/O 2009 – anyone recall something like this?) where they basically pooh-poohed git’s internal architecture as being difficult to extend / enhance. Google’s specific requirement was to be able to modify git to use BigTable for storage rather than a filesystem, and the gist I got from that presentation was that it proved to be too difficult, so they gave up and switched to Mercurial (which was characterised as having a far cleaner design and implementation). I’m working from vague recollection here, so please jump in if you know more about it (particularly if I’ve mischaracterised Google’s decision!).

  2. What would have been interesting is including a column for SQL databases to provide a baseline for discussion among those less-versed in NOSQL.

    -Pie

    • That’s a great idea! I’ll add it now, for comparison.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: