For this kind of exercise I’m quite partial to ye olde comparison matrix – a table with the alternatives listed horizontally (in our case that’s the different classes of NoSQL technology) and with comparison criteria listed vertically (in our case our CPS requirements). You then play a “fill in the blanks” game for all of the cells in the table.
While simple and effective, comparison matrices have a tendency to be rather information-dense – for that reason I’ve elected to use boolean yes / no values in each cell, rather than more granular scoring systems (such as numerical scales). If we were doing a more thorough comparison (say, if we were actually implementing a CPS on top of a NoSQL technology) I would want to use a variety of more granular scoring systems.
Let’s dive in:
|Key / Value||BigTable-like||Document||Graph||Relational (for comparison)|
|1||Richly structured content types||1||1|
|2||Unstructured binary objects||2||3|
|3||Relationships / references / associations|
|5||Branch / merge|
|8||Scalability to large content sets||8|
One of the nice things about comparison matrices is that they immediately highlight the differentiating criteria for any given comparison, allowing one to focus the discussion on those areas where compromises will be necessary.
Looking at the matrix above, the eye is immediately drawn to rows 1, 2 and 3, which is where the variability between the different classes of NoSQL is found. In fact what this immediately tells us is that there’s no “perfect” class of NoSQL for our CPS requirements – none of them satisfy all three of the following requirements:
- Richly structured content types
- Unstructured binary objects
- Relationships / references / associations
What’s worse are rows 5 and 6, and (to a lesser extent) row 9. These rows indicate that there are entire requirements that no extant NoSQL technology satisfies.
In looking at these 3 requirements in isolation, there is an obvious class of systems that do satisfy them – Source Code Management (SCM) systems, especially Distributed SCMs (Mercurial, Bazaar, git and the like). It is important to stress that DSCMs don’t meet many of the other CPS requirements however (for one thing the “data model” they expose is a filesystem, which is completely unsuited for CPS content modeling), so they’re by no means an appropriate CPS repository in and of themselves.
The conclusion I’ve drawn is that the ideal CPS repository would be a three-way love child between 2 classes of NoSQL solution (document and graph databases) and DSCM. Specifically:
- Document database style data model (satisfies requirements #1, #2, #4 and #7) plus Graph database style relationships (satisfies requirement #3)
- SCM style versioning and branch/merge (satisfies requirements #5 and #6)
- DSCM or CouchDB style multi-master replication and conflict detection (satisfies requirement #9)
I think of this hybrid as a “Document-Relational10 Version Control System” (DRVCS for the acronym junkies).
All three kinds of system meet requirement #8 when used in isolation, and a cursory analysis doesn’t reveal any obvious reasons why the DRVCS couldn’t retain this characteristic (although the devil is often in the details when it comes to questions of scale).
The bigger questions here are whether anyone is considering implementing such a repository, and if not whether it’s possible to emulate the missing requirements on top of any of the extant NoSQL technologies.
To the first point, I’m not aware of any efforts aimed at implementing a DRVCS, although I’m by no means familiar with all of the repository development going on out there. I’m also sure I’ll receive comments that such-and-such a technology already meets all of these requirements. I’m skeptical that such a beast actually exists, but would love to be proven wrong – what would convince me is the addition of the technology to the comparison matrix above, with a brief description in each of the new column’s cells describing how the technology satisfies each requirement.
I’ve also spent a bit of time thinking about whether it would be possible to build out the missing features on top of an extant NoSQL technology, and at this stage my gut tells me that the branch / merge and snapshot based versioning requirements would be difficult (and probably impractical) to build on top of any of these technologies. These features are technically complex enough that they deserve first class support in the repository, rather than being implemented as an “emulation” layer on top of something else.
Next up I’ll be reviewing the repository-level requirements for Presentation Management CMSes, and we’ll repeat this comparison process to see how NoSQL technologies stack up against that use case. Fingers crossed for a more positive outcome!
1 Standard practice in graph and relational databases is to decompose complex objects into separate vertexes / rows with connecting edges / foreign keys. I don’t particularly like this approach since it confuses true inter-object relationships (“links” between otherwise independent content items) with the internally rich data structures of those content items. With a relational databases you can fake it by using 1-M foreign key constraints with CASCADE DELETE, although I see that as a weak substitute for “real” richly structured content types. Both relational and graph databases also impose “reconstitution costs” (joins) when an object is retrieved in its entirety – document databases don’t have this issue (this issue is nicely described in The Future of Databases).
2 While it is possible to store large binary objects in most graph databases, some (notably Neo4J) don’t recommend doing so.
3 While it is possible to store large binary objects in most relational databases (via LOB data types), this is generally frowned upon as relational databases typically aren’t very efficient at handling them and there are no standard ways to perform certain types of I/O within a single binary object (e.g. pulling out particular byte ranges, random I/O, etc.). In fact I was very close to putting a cross in this cell in the matrix – there are very good reasons why most CMSes that use a relational database choose to store binaries outside in a “real” filesystem.
4 Virtually all of the NoSQL classes mentioned here support Schema Evolution but not directly – instead they are “schemaless”, which means they don’t impose or require that the data model is declared to the repository up front. This is in fact an ideal way to support schema evolution, by allowing higher level CMS-specific logic to decide what a “schema” is, and how schema evolution will be supported, without the underlying persistence mechanism getting in the way. Compare this to the equivalent operation in a relational database (while recalling the good old days of multi-day “ALTER TABLE” marathons!) and you’ll quickly appreciate how liberating a schemaless repository can be.
5 Typical BigTable-like systems aren’t fully schemaless, in that some form of up front “data modeling” is required, and once instances of that schema exist the options for modifying those models are limited.
6 Most NoSQL solutions only support ACIDity in predefined ways – they don’t typically allow external logic to programmatically demarcate arbitrary transactional boundaries. For example document databases typically only support ACIDity for a single document at a time, but not for batches of documents (note that the graph database Neo4J is a notable exception to this general trend). In my opinion this is perfectly adequate for most CMS uses cases, including the CPS use case.
7 MongoDB’s default behaviour is not what is traditionally thought of as ACID – in particular consistency and durability are relaxed in order to improve performance. That said MongoDB provides mechanisms for the implementer to increase both consistency and durability (at the cost of performance) – these facilities are described in this great guide.
8 Many in the NoSQL movement would tout this as the primary limitation of relational databases, however recall that for the Content Production use case, the scalability to large traffic volumes is not a requirement (an editorial team cannot generate anything approaching an internet scale traffic load), and the data volumes are not necessarily beyond what a relational database can handle either (a typical web site has 10s to 100s of thousands of discrete content items, and when stored in a relational data model that would typically equate to one or two orders of magnitude more rows). Some might argue I’m being somewhat lenient on relational databases in this analysis, but there are reasons they’ve enjoyed such widespread adoption for so long, beyond the momentum / familiarity arguments.
9 CouchDB stands out in this regard, with exceptional support for geographical distribution. It achieves this by offering a replication mechanism that can be used to create multi-master topologies with automatic conflict detection (so that concurrent updates in different places are detected, and can be resolved by higher level logic – in our case that would be done by the hypothetical CPS logic that’s sitting on top of CouchDB). CouchDB’s replication functionality is, in my opinion, a perfect fit for this requirement, and we’ll be hearing more about it in my upcoming post on NoSQL in the context of Document Management CMSes.
10 “Relational” isn’t the most accurate term to describe the “relationship / reference / association” part of this hypothetical system, but I haven’t come up with a better alternative.