BigTable | The CMS Curmudgeon

NoSQL and CMS – Comparing NoSQL with CPS Requirements

Now that we have both our CPS requirements mapped out, and a reasonable understanding of the NoSQL Universe, we can finally get to the fun stuff: mapping requirements to technologies.

For this kind of exercise I’m quite partial to ye olde comparison matrix – a table with the alternatives listed horizontally (in our case that’s the different classes of NoSQL technology) and with comparison criteria listed vertically (in our case our CPS requirements). You then play a “fill in the blanks” game for all of the cells in the table.

While simple and effective, comparison matrices have a tendency to be rather information-dense – for that reason I’ve elected to use boolean yes / no values in each cell, rather than more granular scoring systems (such as numerical scales). If we were doing a more thorough comparison (say, if we were actually implementing a CPS on top of a NoSQL technology) I would want to use a variety of more granular scoring systems.

Let’s dive in:

		Key / Value	BigTable-like	Document	Graph	Relational (for comparison)
1	Richly structured content types				¹	¹
2	Unstructured binary objects				²	³
3	Relationships / references / associations
4	Schema evolution⁴		⁵
5	Branch / merge
6	Snapshot-based versioning
7	ACID transactions⁶			⁷
8	Scalability to large content sets					⁸
9	Geographic distribution			⁹
REQUIREMENTS MET		4	4	5	5	4

One of the nice things about comparison matrices is that they immediately highlight the differentiating criteria for any given comparison, allowing one to focus the discussion on those areas where compromises will be necessary.

Looking at the matrix above, the eye is immediately drawn to rows 1, 2 and 3, which is where the variability between the different classes of NoSQL is found. In fact what this immediately tells us is that there’s no “perfect” class of NoSQL for our CPS requirements – none of them satisfy all three of the following requirements:

Richly structured content types
Unstructured binary objects
Relationships / references / associations

What’s worse are rows 5 and 6, and (to a lesser extent) row 9. These rows indicate that there are entire requirements that no extant NoSQL technology satisfies.

In looking at these 3 requirements in isolation, there is an obvious class of systems that do satisfy them – Source Code Management (SCM) systems, especially Distributed SCMs (Mercurial, Bazaar, git and the like). It is important to stress that DSCMs don’t meet many of the other CPS requirements however (for one thing the “data model” they expose is a filesystem, which is completely unsuited for CPS content modeling), so they’re by no means an appropriate CPS repository in and of themselves.

The conclusion I’ve drawn is that the ideal CPS repository would be a three-way love child between 2 classes of NoSQL solution (document and graph databases) and DSCM. Specifically:

Document database style data model (satisfies requirements #1, #2, #4 and #7) plus Graph database style relationships (satisfies requirement #3)
SCM style versioning and branch/merge (satisfies requirements #5 and #6)
DSCM or CouchDB style multi-master replication and conflict detection (satisfies requirement #9)

I think of this hybrid as a “Document-Relational¹⁰ Version Control System” (DRVCS for the acronym junkies).

All three kinds of system meet requirement #8 when used in isolation, and a cursory analysis doesn’t reveal any obvious reasons why the DRVCS couldn’t retain this characteristic (although the devil is often in the details when it comes to questions of scale).

The bigger questions here are whether anyone is considering implementing such a repository, and if not whether it’s possible to emulate the missing requirements on top of any of the extant NoSQL technologies.

To the first point, I’m not aware of any efforts aimed at implementing a DRVCS, although I’m by no means familiar with all of the repository development going on out there. I’m also sure I’ll receive comments that such-and-such a technology already meets all of these requirements. I’m skeptical that such a beast actually exists, but would love to be proven wrong – what would convince me is the addition of the technology to the comparison matrix above, with a brief description in each of the new column’s cells describing how the technology satisfies each requirement.

I’ve also spent a bit of time thinking about whether it would be possible to build out the missing features on top of an extant NoSQL technology, and at this stage my gut tells me that the branch / merge and snapshot based versioning requirements would be difficult (and probably impractical) to build on top of any of these technologies. These features are technically complex enough that they deserve first class support in the repository, rather than being implemented as an “emulation” layer on top of something else.

Next Up

Next up I’ll be reviewing the repository-level requirements for Presentation Management CMSes, and we’ll repeat this comparison process to see how NoSQL technologies stack up against that use case. Fingers crossed for a more positive outcome!

¹ Standard practice in graph and relational databases is to decompose complex objects into separate vertexes / rows with connecting edges / foreign keys. I don’t particularly like this approach since it confuses true inter-object relationships (“links” between otherwise independent content items) with the internally rich data structures of those content items. With a relational databases you can fake it by using 1-M foreign key constraints with CASCADE DELETE, although I see that as a weak substitute for “real” richly structured content types. Both relational and graph databases also impose “reconstitution costs” (joins) when an object is retrieved in its entirety – document databases don’t have this issue (this issue is nicely described in The Future of Databases).
² While it is possible to store large binary objects in most graph databases, some (notably Neo4J) don’t recommend doing so.
³ While it is possible to store large binary objects in most relational databases (via LOB data types), this is generally frowned upon as relational databases typically aren’t very efficient at handling them and there are no standard ways to perform certain types of I/O within a single binary object (e.g. pulling out particular byte ranges, random I/O, etc.). In fact I was very close to putting a cross in this cell in the matrix – there are very good reasons why most CMSes that use a relational database choose to store binaries outside in a “real” filesystem.
⁴ Virtually all of the NoSQL classes mentioned here support Schema Evolution but not directly – instead they are “schemaless”, which means they don’t impose or require that the data model is declared to the repository up front. This is in fact an ideal way to support schema evolution, by allowing higher level CMS-specific logic to decide what a “schema” is, and how schema evolution will be supported, without the underlying persistence mechanism getting in the way. Compare this to the equivalent operation in a relational database (while recalling the good old days of multi-day “ALTER TABLE” marathons!) and you’ll quickly appreciate how liberating a schemaless repository can be.
⁵ Typical BigTable-like systems aren’t fully schemaless, in that some form of up front “data modeling” is required, and once instances of that schema exist the options for modifying those models are limited.
⁶ Most NoSQL solutions only support ACIDity in predefined ways – they don’t typically allow external logic to programmatically demarcate arbitrary transactional boundaries. For example document databases typically only support ACIDity for a single document at a time, but not for batches of documents (note that the graph database Neo4J is a notable exception to this general trend). In my opinion this is perfectly adequate for most CMS uses cases, including the CPS use case.
⁷ MongoDB’s default behaviour is not what is traditionally thought of as ACID – in particular consistency and durability are relaxed in order to improve performance. That said MongoDB provides mechanisms for the implementer to increase both consistency and durability (at the cost of performance) – these facilities are described in this great guide.
⁸ Many in the NoSQL movement would tout this as the primary limitation of relational databases, however recall that for the Content Production use case, the scalability to large traffic volumes is not a requirement (an editorial team cannot generate anything approaching an internet scale traffic load), and the data volumes are not necessarily beyond what a relational database can handle either (a typical web site has 10s to 100s of thousands of discrete content items, and when stored in a relational data model that would typically equate to one or two orders of magnitude more rows). Some might argue I’m being somewhat lenient on relational databases in this analysis, but there are reasons they’ve enjoyed such widespread adoption for so long, beyond the momentum / familiarity arguments.
⁹ CouchDB stands out in this regard, with exceptional support for geographical distribution. It achieves this by offering a replication mechanism that can be used to create multi-master topologies with automatic conflict detection (so that concurrent updates in different places are detected, and can be resolved by higher level logic – in our case that would be done by the hypothetical CPS logic that’s sitting on top of CouchDB). CouchDB’s replication functionality is, in my opinion, a perfect fit for this requirement, and we’ll be hearing more about it in my upcoming post on NoSQL in the context of Document Management CMSes.
¹⁰ “Relational” isn’t the most accurate term to describe the “relationship / reference / association” part of this hypothetical system, but I haven’t come up with a better alternative.

Published in:

nosql
wcm

on 2010-07-12 at 8:06 am Comments (5)
Tags: bazaar, BigTable, branch, cassandra, comparison, couchdb, cps, document, drvcs, dscm, git, graph, key-value, mercurial, merge, mongodb, neo4j, nosql, scm, snapshot, version, wcm

NoSQL and CMS – A Brief Overview of the “nosql” Universe

Before we compare our list of CPS requirements against the universe of NoSQL, it helps to have an understanding of what NoSQL actually is. As it turns out, NoSQL doesn’t refer to a single technology but rather a grab bag of otherwise unrelated technologies that are only loosely “related” by virtue of not being based on relational (specifically SQL) technologies.

And you thought the NoSQL moniker was simply a clever marketing ploy to incite rage amongst relational aficionados! 🙂

While Wikipedia lists no less than 7 major types of NoSQL solution, the “Big 4” that are discussed most frequently are:

Key / Value
BigTable-like
Document
Graph

Let’s look at each of them in turn.

Key / Value

These solutions are typically little more than sophisticated distributed hash tables, often adding direct support for some combination of persistence (durability of data across restarts), replication (physical duplication of data, typically across multiple servers) and sharding (partitioning of data into discrete subsets, each of which is stored on separate sets of servers). They often restrict what data can be used for keys, while allowing values to be any kind of data that can be serialised. Typically querying can only be done by key.

Lord Voldemort looking a tad peeved Some examples:

BigTable-like

Not just big, this table is humungous! These solutions are based on Google’s BigTable system (used internally by Google, as well as being the primary data storage facility available to Google App Engine developers). They can be thought of as being one step above a Key / Value store in that the values are not simply unstructured binary blobs that are opaque to the store, but are instead structured data elements that can be used for additional non-key based queries.

In some respects these solutions are quite similar to a relational database, minus explicit foreign keys, and with support for different “rows” in the same “table” having different “columns” (this is somewhat of an over-simplification, but from a data/content modeling perspective is reasonably accurate).

Some examples:

Document

The original and still the best way to store documents Document databases store “flat” collections of structured documents – in this case “document” does not (as the name might suggest) mean binary documents (Word, PDF etc.), but rather structured data objects with potentially rich internal structures.

The current generation of document databases have, for the most part, standardised on JSON as the underlying data format for documents, however I consider XML databases to fall under this umbrella as well (albeit many of those have additional facilities for slicing and dicing XML in various weird and unnatural ways that are illegal in some states).

Interestingly, query facilities vary widely between the extant document databases, with some offering query facilities on par with relational databases, while others don’t provide anything resembling a traditional query capability^†.

Some examples:

Graph

Graph databases are unquestionably the oldest NoSQL solution, with at least one example predating the relational model by several years!

In these databases, data is stored as a series of discrete objects (“vertexes”) connected by zero or more relationships (“edges”) to one another. Typically the objects are simple hash table data structures and cannot have rich internal data structures (in contrast to a document in a Document database).

Some examples:

Summary

Borrowing a nice diagram from the Neo4J folks, one way of comparing the different classes of NoSQL solution is as follows:

^{Note: when looking at this graph I mentally replace the “Complexity” label with “Sophistication of data modeling” – the diagram is equally accurate with that substitution and to my mind that’s a more interesting picture (not to mention more relevant to the discussion of CMS and NoSQL).}

In the next post we’ll instead look at how these classes of NoSQL solution compare to the CPS requirements we previously identified.

Addendum

There are any number of good NoSQL primers available on the interwebitubes, and I’d encourage you to read them if you’re new to the topic. I particularly like:

Slides 11 through 17 of “A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)” by Neo4J’s Emil Eifrem.
NoSQL – the Shift to a Non-Relational World
NoSQL – Death to Relational Databases(?)
Ricky Ho’s NOSQL Patterns (if you’re after a little more “meat”)

^† Before I get flamed to a burnt crisp by the CouchDB fanbois, yes I’m quite familiar with map/reduce “materialised” views – I simply don’t consider that to be a “real” query mechanism. This feature also runs afoul of my “avoid crystal ball gazing at all costs” principle, but that’s a topic for another day.

Published in:

nosql

on 2010-07-08 at 6:36 pm Comments (4)
Tags: allegrograph, BigTable, cassandra, couchdb, document, dynamo, exist, graph, hbase, infogrid, key-value, marklogic, memcache, mongodb, neo4j, nosql, voldemort

The CMS Curmudgeon

Content Management? Bah humbug!

NoSQL and CMS – Comparing NoSQL with CPS Requirements

Next Up

NoSQL and CMS – A Brief Overview of the “nosql” Universe

Key / Value

BigTable-like

Document

Graph

Summary

Addendum

Recent Posts

Blogroll